Vision models describe what’s in an image, but they can’t handle spatial references. Point at an object and ask “What color is this car?” and the model doesn’t know what you’re talking about. In this post we’ll learn about Set-of-Mark prompting and how vision models can see what you’re seeing 👀
Building a Production Ready Text-to-Speech API
Set yourself apart from other MLEs by learning how to work with audio and serve Text-to-Speech models.
Self-Hosting LLMs with vLLM
Get ready to save some money 💰. In this post, you’ll learn how to set up your own LLM server using vLLM, choose the right models, and build an architecture that fits your use case.
