Announcing Gaze Detection.

Vision-Language Models (VLMs) are “foundational” as they can be adapted for many different tasks. Since Moondream’s launch, we’ve released several capabilities such as Object Detection and Pointing. With a new Moondream launch scheduled later this week, we’re excited to pre-announce a new capability: Gaze Detection.

This capability does what it says: it determines what people in the image are looking at. This is convenient for captioning videos, understanding social dynamics, and for specific cases such as sports analytics or detecting when drivers or operators are distracted. It’s likely useful for even more use cases we haven’t thought of yet. That’s why it’s so exciting for us to keep Moondream open source — it makes it easier (and cheaper) for everyone to build together.

Moondream is achieving promising results. It’s currently scoring 0.103 on the Avg L2 GazeFollow benchmark, which is close to a state-of-the-art result. There’s a specialized model called Gaze-LLE that reaches 0.099 (lower is better). An actual human scores about 0.096, so Moondream is about as good as asking a human to do it.

This capability will be in our upcoming Moondream release scheduled for later this week. Meanwhile, you can see it in action and try it out here. We’re excited to see gaze detection get used in next generation Vision AI apps. Let us know if you have any plans to use it, or if you have any questions.

We’ll have more release announcements lined up this week. Keep Moondreaming y’all.