Model Release

Model Release

Model Release

Moondream 2025-01-09 Release: Structured Text, Enhanced OCR, Gaze Detection

January 9, 2025

Today, we’re announcing a new release of Moondream 1.9B. It has improvements across a bunch of areas and includes a new capability, Gaze Detection. This release marks the first time we’ve focused on industry benchmarks, and we’re excited to share some results. Despite these upgrades, the model is still just 1.9B, so it's fast and can run everywhere. Try it out in our playground or download it now.

1. Structured Output

Building with Moondream is easier than ever with our new support for structured output formats such as JSON, XML, Markdown and CSV. Here’s some examples:

Example 1: JSON structured output

Example 2: XML structured output

Example 3: Markdown Structured Output

2. New Capability: Gaze Detection

Traditional Vision AI consists of specialized models that are built for different tasks like “object detection” (outline a specified objects region in an image), or “captioning” (create a caption for an image). Moondream supports several of these common Vision AI tasks as “capabilities”, all within a single model. Moondream already supports object detection and captioning, as well as “visual querying” (ask any question to a photo), and “pointing” (get the x,y coordinates of any elements within a photo).

Today, we are excited to launch a new capability: “Gaze Detection”. This capability tracks human attention. Note that this capability is experimental. We’re releasing it to get feedback from developers so we can improve it over time.

Example 1: Driver Gaze Detection

Example 2: Sport Gaze Detection

3. Benchmarks

We’ve always been a bit iffy about benchmarks. Some focus on problems we don’t think are relevant to Moondream (e.g., solving math equations). Others include weird questions and wrong answers (at least to us – see the Weird Benchmarks appendix below). And focusing too much on benchmarks can lead to weird behaviors, with allegations that some models “cheat” by training on the actual benchmarks itself.

Despite this, we decided to improve our scores because we don’t want anyone sleeping on Moondream because of low results. We benchmarked ourselves along with the top small vision language models.

You can find our individual benchmarks results below.

4. Better OCR

We made changes to Mondream’s vision layer that have helped improve text reading / OCR significantly. We’ve also trained it on a lot more document querying and understanding. Here’s some examples:

Example 1: OCR Example

Example 2: Chart OCR Example

Looking Ahead

As pumped as we are about this release, the best part, for us, is seeing what you build with it. VLMs are making it faster, cheaper and easier than ever to build next generation vision-enabled apps. Getting setup takes minutes, or you can try out Moondream on our playground. We offer Cloud inference with a generous free tier, or just go ahead and download it and run it yourself. Checkout our docs for a getting started guide and lots of sample code.

Happy Moondreaming!

Appendix 1: Weird Benchmark Questions

Here’s a few examples of weird benchmark questions…

Example 1: Confusing Benchmark Question

In GQA, the following image has a question that asks “Is the traffic signal on the right side or the left?”. If you look closely, you can see there are traffic lights on both sides of the street. However GQA expects the answer to be “Left”.

Example 2: Nonsensical Benchmark Question

In the following image, GQA asks “What animal sits in the bench that is on the right side?". It expects the answer to be “bird” –  🤯.

Continue Reading

Check out our latest blogs

VISION AI THAT RUNS EVERYWHERE

VISION AI THAT RUNS EVERYWHERE