We introduced segmenting as a Moondream skill in September 2025 with Moondream 3 Preview. It launched with state-of-the-art scores on segmenting benchmarks. Despite the launch of several segmenting vision models since then, Moondream remains top dog.
Today, we're excited to announce that we've raised the bar even further with an improvement now live on Moondream Cloud. This new version produces better segmenting results, achieves better benchmark scores, and does it 40% faster than before.
Examples
Before diving into the details, let's take a look at some examples.
Moondream Segmenting Recap
Put simply, what makes Moondream segmenting different is that it:
- produces native SVG masks (vectors, not bitmasks)
- state of the art on segmentation benchmarks
- offers quick inference speeds, even if raw speed is not the only thing we optimize for
- supports deep, native referring capabilities such as "the person touching the door"
Benchmark Improvements
This latest segmentation model delivers a significant leap in performance across all major referring expression benchmarks. On RefCOCO+ Val, which tests attribute-based reasoning without positional cues, we achieve 79.1 mIoU, a 4.4-point improvement over the previous state-of-the-art (which was also Moondream!). RefCOCOg Val, which evaluates complex natural language descriptions, sees similar gains at 80.7 mIoU. We also report 88.2 mIoU on RefCOCO-M, our high-fidelity benchmark with pixel-accurate masks, underscoring that these gains translate to real-world precision, not just benchmark optimization.
| Metric: mIoU | RefCOCO Val | RefCOCO+ Val | RefCOCOg Val | RefCOCO-m |
|---|---|---|---|---|
| Old | 81.8 | 74.7 | 76.4 | 86.9 |
| New | 83.2 (+1.4) | 79.1 (+4.4) | 80.7 (+4.3) | 88.2 (+1.3) |
How We Compare
Most segmentation-capable VLMs are either accurate or fast, but not both. Large multimodal models with bolted-on segmentation decoders can handle complex queries, but they are slow and expensive to run at scale. Lightweight models are fast, but they choke on anything beyond simple noun phrases. Moondream closes this gap: state-of-the-art accuracy at speeds that make latency-sensitive and high-throughput applications practical.
Moondream vs. SAM 3
SAM 3 can segment generic concepts like "car" or "person", but it can't natively resolve referring expressions. For prompts like "the person touching the door" or "laundry on the floor," you need to pair it with a larger reasoning model that adds 10s of seconds of latency and drives up cost. Moondream handles complex prompts natively, returns crisp higher-quality SVG masks, at a 5x lower price point.
Conclusion
This update is live now on Moondream Cloud. If you're already using segmentation, you get better quality and lower latency immediately. Later this week, we'll also be releasing the model for local inference, along with a technical whitepaper for those who want to go deeper. Learn more about Moondream's segmentation skill at /skills/segment.



