Vision models describe what’s in an image, but they can’t handle spatial references. Point at an object and ask “What color is this car?” and the model doesn’t know what you’re talking about. In this post we’ll learn about Set-of-Mark prompting and how vision models can see what you’re seeing 👀
When State Space Models Learned to See Globally 👓
Mamba promised linear time complexity. Vision benchmarks said “prove it”, and pure Mamba models fumbled badly. Transformers kept winning until MambaVision showed up and rewrote the rules entirely 🐍👓.
