Moondream 2025-01-09 Release: Structured Text, Enhanced OCR, Gaze Detection
January 9, 2025
Today, we’re announcing a new release of Moondream 1.9B. It has improvements across a bunch of areas and includes a new capability, Gaze Detection. This release marks the first time we’ve focused on industry benchmarks, and we’re excited to share some results. Despite these upgrades, the model is still just 1.9B, so it's fast and can run everywhere. Try it out in our playground or download it now.
1. Structured Output
Building with Moondream is easier than ever with our new support for structured output formats such as JSON, XML, Markdown and CSV. Here’s some examples:
Example 1: JSON structured output
Example 2: XML structured output
Example 3: Markdown Structured Output
2. New Capability: Gaze Detection
Traditional Vision AI consists of specialized models that are built for different tasks like “object detection” (outline a specified objects region in an image), or “captioning” (create a caption for an image). Moondream supports several of these common Vision AI tasks as “capabilities”, all within a single model. Moondream already supports object detection and captioning, as well as “visual querying” (ask any question to a photo), and “pointing” (get the x,y coordinates of any elements within a photo).
Today, we are excited to launch a new capability: “Gaze Detection”. This capability tracks human attention. Note that this capability is experimental. We’re releasing it to get feedback from developers so we can improve it over time.
Example 1: Driver Gaze Detection
Example 2: Sport Gaze Detection
3. Benchmarks
We’ve always been a bit iffy about benchmarks. Some focus on problems we don’t think are relevant to Moondream (e.g., solving math equations). Others include weird questions and wrong answers (at least to us – see the Weird Benchmarks appendix below). And focusing too much on benchmarks can lead to weird behaviors, with allegations that some models “cheat” by training on the actual benchmarks itself.
Despite this, we decided to improve our scores because we don’t want anyone sleeping on Moondream because of low results. We benchmarked ourselves along with the top small vision language models.
You can find our individual benchmarks results below.
4. Better OCR
We made changes to Mondream’s vision layer that have helped improve text reading / OCR significantly. We’ve also trained it on a lot more document querying and understanding. Here’s some examples:
Example 1: OCR Example
Example 2: Chart OCR Example
Looking Ahead
As pumped as we are about this release, the best part, for us, is seeing what you build with it. VLMs are making it faster, cheaper and easier than ever to build next generation vision-enabled apps. Getting setup takes minutes, or you can try out Moondream on our playground. We offer Cloud inference with a generous free tier, or just go ahead and download it and run it yourself. Checkout our docs for a getting started guide and lots of sample code.
Happy Moondreaming!
Appendix 1: Weird Benchmark Questions
Here’s a few examples of weird benchmark questions…
Example 1: Confusing Benchmark Question
In GQA, the following image has a question that asks “Is the traffic signal on the right side or the left?”. If you look closely, you can see there are traffic lights on both sides of the street. However GQA expects the answer to be “Left”.
Example 2: Nonsensical Benchmark Question
In the following image, GQA asks “What animal sits in the bench that is on the right side?". It expects the answer to be “bird” – 🤯.