Models
Open vision models made for production environments, with built-in grounding skills. Fast enough for realtime. Small enough for edge. Accurate enough to beat frontier models on key benchmarks.
5M+
Monthly downloads
3
Model variants
6
Built-in skills
32K
Context window
Why Moondream
Not just another VLM
Moondream models are purpose-built for production vision AI. They are not general-purpose chatbots that happen to see images. Every architectural decision is made to optimize for real-world vision tasks at scale.
Built-in grounded skills
Object detection, pointing, captioning, visual Q&A, and segmentation are native model capabilities, not prompt hacks on top of a chatbot. The model outputs bounding boxes, coordinates, and masks directly.
Realtime inference
Moondream + Photon delivers realtime speeds on server GPUs and responsive performance on edge devices. Purpose-built CUDA kernels, automatic batching, and prefix caching keep latency low at any scale.
Runs everywhere
With Photon, Moondream runs on everything from H100 servers to Jetson Orin Nanos. Same model, same API, same results. Cloud, desktop, or embedded in a camera on a factory floor.
Frontier-level benchmarks
Moondream 3 Preview matches or beats models orders of magnitude larger on key grounding benchmarks. SOTA on multiple segmentation tasks. All with 2B active parameters.
Open weights
All Moondream models are available on HuggingFace with permissive licensing. Free for personal, research, and most commercial use. No restrictions on internal production deployments.
Fine-tuning ready
Moondream models are designed to be trained. Use Lens to fine-tune with supervised learning or reinforcement learning and deploy instantly with Photon. Good out of the box, great when customized.
Built-in skills
Grounded vision capabilities, not prompt tricks
These skills are trained into the model architecture. They output structured spatial data (bounding boxes, coordinates, masks) directly, not text descriptions of where things are.
Detect
Open-vocabulary object detection. Describe what you're looking for in natural language and get bounding boxes back.
Learn moreCaption
Generate rich image descriptions in short, normal, or long formats. Sub-400ms latency.
Learn moreQuery
Ask natural language questions about images and get accurate, detailed answers.
Learn morePoint
Return exact (x, y) coordinates for every instance of an object you describe. Two tokens per point.
Learn moreSegment
Turn text prompts into pixel-accurate SVG masks. State-of-the-art on RefCOCO+ segmentation.
Learn moreModel lineup
Choose your model
Three models, one API. Moondream 3 Preview is the default for new projects. Moondream 2 is production-stable. Moondream 2 0.5B is a distillation target for extreme edge.
Moondream 3 Preview
The latest Moondream architecture. 9B total parameters with a sparse mixture-of-experts design that activates only 2B parameters per token. Frontier-level visual reasoning, grounded thinking, and native segmentation, at inference speeds comparable to a 2B dense model. Trained on ~450B tokens with reinforcement learning across 55+ vision-language tasks. Context length extended to 32K tokens.
Architecture
9B MoE
2B active params
Experts
64 / 8
Total / active per token
Context
32K
Tokens
Vision Encoder
SigLIP
Multi-crop channel concat
Tokenizer
SuperBPE
20-40% faster generation
Reasoning
Yes
Grounded visual reasoning
Skills
Benchmarks
| Benchmark | Score |
|---|---|
| ScreenSpot F1@0.5 | 80.4 |
| CountBenchQA | 86.4 |
| COCO mAP | 51.2 |
| DocVQA | 79.3 |
| ChartQA | 77.5 |
| TextVQA | 76.3 |
| OCRBench | 61.2 |
| RefCOCO Val (Seg) | 83.2 mIoU |
| RefCOCO+ Val (Seg) | 79.1 mIoU |
| RefCOCOg Val (Seg) | 80.7 mIoU |
Moondream produces answers in a fraction of the time of the frontier models it competes with on these benchmarks.
Moondream 2
The workhorse. A 2B dense model that punches well above its weight. Over 5 million monthly downloads on HuggingFace. Runs on GPUs, CPUs, mobile devices, and Raspberry Pis. Supports fp16, int8, and int4 quantization with quantization-aware training. The int4 variant achieves a 42% memory reduction with only a 0.6% accuracy drop. Continuously updated since March 2024.
Architecture
2B Dense
1.9B parameters
Context
2K
Tokens
Quantization
fp16 / int8 / int4
QAT-trained
VRAM (int8)
2,624 MiB
Runtime memory
VRAM (int4)
2,002 MiB
Runtime memory
Speed (RTX 3090)
184 tok/s
int4, with compile()
Skills
Benchmarks (2025-06-21 release)
| Benchmark | Score |
|---|---|
| ScreenSpot F1@0.5 | 80.4 |
| CountBenchQA | 86.4 |
| COCO mAP | 51.2 |
| DocVQA | 79.3 |
| ChartQA | 77.5 |
| TextVQA | 76.3 |
| OCRBench | 61.2 |
These are scores for the latest (2025-06-21) release. Moondream 2 is updated regularly with benchmark improvements.
Moondream 2 0.5B
The smallest vision language model available. 500 million parameters. Designed primarily as a distillation target for extreme edge deployments where every megabyte matters. The int4 variant downloads at just 375 MiB and runs in 816 MiB of memory. Not generally recommended out of the box for most use cases. Its real value is as a starting point: fine-tune it with Lens to create a specialized, tiny model that fits your exact hardware constraints.
Architecture
0.5B Dense
500M parameters
Quantization
int8 / int4
QAT-trained
Download (int8)
479 MiB
Compressed
RAM (int8)
996 MiB
Runtime memory
Download (int4)
375 MiB
Compressed
RAM (int4)
816 MiB
Runtime memory
Skills
Best used as a fine-tuning base
Out-of-the-box accuracy is limited at this size. The 0.5B model is designed to be fine-tuned with Lens for a specific task, then deployed on extremely constrained hardware like mobile devices, Raspberry Pis, or embedded cameras. When fine-tuned for a narrow use case, accuracy improves dramatically.
FAQ