Models

Open vision models made for production environments, with built-in grounding skills. Fast enough for realtime. Small enough for edge. Accurate enough to beat frontier models on key benchmarks.

Try in Playground Get Started

5M+

Monthly downloads

Model variants

Built-in skills

32K

Context window

Why Moondream

Not just another VLM

Moondream models are purpose-built for production vision AI. They are not general-purpose chatbots that happen to see images. Every architectural decision is made to optimize for real-world vision tasks at scale.

Built-in grounded skills

Object detection, pointing, captioning, visual Q&A, and segmentation are native model capabilities, not prompt hacks on top of a chatbot. The model outputs bounding boxes, coordinates, and masks directly.

Realtime inference

Moondream + Photon delivers realtime speeds on server GPUs and responsive performance on edge devices. Purpose-built CUDA kernels, automatic batching, and prefix caching keep latency low at any scale.

Runs everywhere

With Photon, Moondream runs on everything from H100 servers to Jetson Orin Nanos. Same model, same API, same results. Cloud, desktop, or embedded in a camera on a factory floor.

Frontier-level benchmarks

Moondream 3 Preview matches or beats models orders of magnitude larger on key grounding benchmarks. SOTA on multiple segmentation tasks. All with 2B active parameters.

Open weights

All Moondream models are available on HuggingFace with permissive licensing. Free for personal, research, and most commercial use. No restrictions on internal production deployments.

Fine-tuning ready

Moondream models are designed to be trained. Use Lens to fine-tune with supervised learning or reinforcement learning and deploy instantly with Photon. Good out of the box, great when customized.

Built-in skills

Grounded vision capabilities, not prompt tricks

These skills are trained into the model architecture. They output structured spatial data (bounding boxes, coordinates, masks) directly, not text descriptions of where things are.

Detect

Open-vocabulary object detection. Describe what you're looking for in natural language and get bounding boxes back.

Learn more

Caption

Generate rich image descriptions in short, normal, or long formats. Sub-400ms latency.

Learn more

Query

Ask natural language questions about images and get accurate, detailed answers.

Learn more

Point

Return exact (x, y) coordinates for every instance of an object you describe. Two tokens per point.

Learn more

Segment

Turn text prompts into pixel-accurate SVG masks. State-of-the-art on RefCOCO+ segmentation.

Learn more

Model lineup

Choose your model

Three models, one API. Moondream 3 Preview is the default for new projects. Moondream 2 is production-stable. Moondream 2 0.5B is a distillation target for extreme edge.

Moondream 3 Preview

Preview

Recommended

The latest Moondream architecture. 9B total parameters with a sparse mixture-of-experts design that activates only 2B parameters per token. Frontier-level visual reasoning, grounded thinking, and native segmentation, at inference speeds comparable to a 2B dense model. Trained on ~450B tokens with reinforcement learning across 55+ vision-language tasks. Context length extended to 32K tokens.

Architecture

9B MoE

2B active params

Experts

64 / 8

Total / active per token

Context

32K

Tokens

Vision Encoder

SigLIP

Multi-crop channel concat

Tokenizer

SuperBPE

20-40% faster generation

Reasoning

Yes

Grounded visual reasoning

Skills

QueryDetectPointCaptionSegmentOCRStructured OutputReasoning

Benchmarks

Benchmark	Score	What it measures
ScreenSpot F1@0.5	80.4	UI element localization and understanding
CountBenchQA	86.4	Counting accuracy in complex scenes
COCO mAP	51.2	Open-vocabulary object detection
DocVQA	79.3	Document understanding and Q&A
ChartQA	77.5	Chart understanding (82.2 with PoT)
TextVQA	76.3	Reading and understanding text in images
OCRBench	61.2	Optical character recognition quality
RefCOCO Val (Seg)	83.2 mIoU	Referring expression segmentation
RefCOCO+ Val (Seg)	79.1 mIoU	Attribute-based segmentation (SOTA)
RefCOCOg Val (Seg)	80.7 mIoU	Complex description segmentation

Moondream produces answers in a fraction of the time of the frontier models it competes with on these benchmarks.

Release blog HuggingFace Docs

Moondream 2

Stable

The workhorse. A 2B dense model that punches well above its weight. Over 5 million monthly downloads on HuggingFace. Runs on GPUs, CPUs, mobile devices, and Raspberry Pis. Supports fp16, int8, and int4 quantization with quantization-aware training. The int4 variant achieves a 42% memory reduction with only a 0.6% accuracy drop. Continuously updated since March 2024.

Architecture

2B Dense

1.9B parameters

Context

Tokens

Quantization

fp16 / int8 / int4

QAT-trained

VRAM (int8)

2,624 MiB

Runtime memory

VRAM (int4)

2,002 MiB

Runtime memory

Speed (RTX 3090)

184 tok/s

int4, with compile()

Skills

QueryDetectPointCaptionOCRSegment

Latest release blog HuggingFace Docs

Moondream 2 0.5B

Distillation Target

The smallest vision language model available. 500 million parameters. Designed primarily as a distillation target for extreme edge deployments where every megabyte matters. The int4 variant downloads at just 375 MiB and runs in 816 MiB of memory. Not generally recommended out of the box for most use cases. Its real value is as a starting point: fine-tune it with Lens to create a specialized, tiny model that fits your exact hardware constraints.

Architecture

0.5B Dense

500M parameters

Quantization

int8 / int4

QAT-trained

Download (int8)

479 MiB

Compressed

RAM (int8)

996 MiB

Runtime memory

Download (int4)

375 MiB

Compressed

RAM (int4)

816 MiB

Runtime memory

Skills

QueryDetectPointCaptionOCRSegment

Best used as a fine-tuning base

Out-of-the-box accuracy is limited at this size. The 0.5B model is designed to be fine-tuned with Lens for a specific task, then deployed on extremely constrained hardware like mobile devices, Raspberry Pis, or embedded cameras. When fine-tuned for a narrow use case, accuracy improves dramatically.

Release blog HuggingFace Fine-tune with Lens

Models

Not just another VLM

Built-in grounded skills

Realtime inference

Runs everywhere

Frontier-level benchmarks

Open weights

Fine-tuning ready

Grounded vision capabilities, not prompt tricks

Detect

Caption

Query

Point

Segment

Choose your model

Moondream 3 Preview

Moondream 2

Moondream 2 0.5B

Frequently asked questions

Which model should I use?

What are "grounded skills" and why do they matter?

How does Moondream 3 Preview have 9B params but run like a 2B model?

What is the licensing for Moondream models?

Can I fine-tune these models?

What hardware do I need to run Moondream locally?

How does Moondream compare to other small VLMs like Qwen-VL or PaliGemma?

Do I need Photon to use Moondream?