Models

Open vision models made for production environments, with built-in grounding skills. Fast enough for realtime. Small enough for edge. Accurate enough to beat frontier models on key benchmarks.

5M+

Monthly downloads

3

Model variants

6

Built-in skills

32K

Context window

Why Moondream

Not just another VLM

Moondream models are purpose-built for production vision AI. They are not general-purpose chatbots that happen to see images. Every architectural decision is made to optimize for real-world vision tasks at scale.

Built-in grounded skills

Object detection, pointing, captioning, visual Q&A, and segmentation are native model capabilities, not prompt hacks on top of a chatbot. The model outputs bounding boxes, coordinates, and masks directly.

Realtime inference

Moondream + Photon delivers realtime speeds on server GPUs and responsive performance on edge devices. Purpose-built CUDA kernels, automatic batching, and prefix caching keep latency low at any scale.

Runs everywhere

With Photon, Moondream runs on everything from H100 servers to Jetson Orin Nanos. Same model, same API, same results. Cloud, desktop, or embedded in a camera on a factory floor.

Frontier-level benchmarks

Moondream 3 Preview matches or beats models orders of magnitude larger on key grounding benchmarks. SOTA on multiple segmentation tasks. All with 2B active parameters.

Open weights

All Moondream models are available on HuggingFace with permissive licensing. Free for personal, research, and most commercial use. No restrictions on internal production deployments.

Fine-tuning ready

Moondream models are designed to be trained. Use Lens to fine-tune with supervised learning or reinforcement learning and deploy instantly with Photon. Good out of the box, great when customized.

Built-in skills

Grounded vision capabilities, not prompt tricks

These skills are trained into the model architecture. They output structured spatial data (bounding boxes, coordinates, masks) directly, not text descriptions of where things are.

Model lineup

Choose your model

Three models, one API. Moondream 3 Preview is the default for new projects. Moondream 2 is production-stable. Moondream 2 0.5B is a distillation target for extreme edge.

Moondream 3 Preview

Preview
Recommended

The latest Moondream architecture. 9B total parameters with a sparse mixture-of-experts design that activates only 2B parameters per token. Frontier-level visual reasoning, grounded thinking, and native segmentation, at inference speeds comparable to a 2B dense model. Trained on ~450B tokens with reinforcement learning across 55+ vision-language tasks. Context length extended to 32K tokens.

Architecture

9B MoE

2B active params

Experts

64 / 8

Total / active per token

Context

32K

Tokens

Vision Encoder

SigLIP

Multi-crop channel concat

Tokenizer

SuperBPE

20-40% faster generation

Reasoning

Yes

Grounded visual reasoning

Skills

QueryDetectPointCaptionSegmentOCRStructured OutputReasoning

Benchmarks

BenchmarkScore
ScreenSpot F1@0.580.4
CountBenchQA86.4
COCO mAP51.2
DocVQA79.3
ChartQA77.5
TextVQA76.3
OCRBench61.2
RefCOCO Val (Seg)83.2 mIoU
RefCOCO+ Val (Seg)79.1 mIoU
RefCOCOg Val (Seg)80.7 mIoU

Moondream produces answers in a fraction of the time of the frontier models it competes with on these benchmarks.

Moondream 2

Stable

The workhorse. A 2B dense model that punches well above its weight. Over 5 million monthly downloads on HuggingFace. Runs on GPUs, CPUs, mobile devices, and Raspberry Pis. Supports fp16, int8, and int4 quantization with quantization-aware training. The int4 variant achieves a 42% memory reduction with only a 0.6% accuracy drop. Continuously updated since March 2024.

Architecture

2B Dense

1.9B parameters

Context

2K

Tokens

Quantization

fp16 / int8 / int4

QAT-trained

VRAM (int8)

2,624 MiB

Runtime memory

VRAM (int4)

2,002 MiB

Runtime memory

Speed (RTX 3090)

184 tok/s

int4, with compile()

Skills

QueryDetectPointCaptionOCRSegment

Benchmarks (2025-06-21 release)

BenchmarkScore
ScreenSpot F1@0.580.4
CountBenchQA86.4
COCO mAP51.2
DocVQA79.3
ChartQA77.5
TextVQA76.3
OCRBench61.2

These are scores for the latest (2025-06-21) release. Moondream 2 is updated regularly with benchmark improvements.

Moondream 2 0.5B

Distillation Target

The smallest vision language model available. 500 million parameters. Designed primarily as a distillation target for extreme edge deployments where every megabyte matters. The int4 variant downloads at just 375 MiB and runs in 816 MiB of memory. Not generally recommended out of the box for most use cases. Its real value is as a starting point: fine-tune it with Lens to create a specialized, tiny model that fits your exact hardware constraints.

Architecture

0.5B Dense

500M parameters

Quantization

int8 / int4

QAT-trained

Download (int8)

479 MiB

Compressed

RAM (int8)

996 MiB

Runtime memory

Download (int4)

375 MiB

Compressed

RAM (int4)

816 MiB

Runtime memory

Skills

QueryDetectPointCaptionOCRSegment

Best used as a fine-tuning base

Out-of-the-box accuracy is limited at this size. The 0.5B model is designed to be fine-tuned with Lens for a specific task, then deployed on extremely constrained hardware like mobile devices, Raspberry Pis, or embedded cameras. When fine-tuned for a narrow use case, accuracy improves dramatically.

FAQ

Frequently asked questions