Vision models describe what’s in an image, but they can’t handle spatial references. Point at an object and ask “What color is this car?” and the model doesn’t know what you’re talking about. In this post we’ll learn about Set-of-Mark prompting and how vision models can see what you’re seeing 👀
