Changelog

This page documents all notable changes to Moondream.

Moondream 2025-03-27 Release2025-03-27

Improvements

Added support for long-form captioning
Improved counting accuracy (e.g. CountBenchQA increased from 80 to 86.4)
Improved text understanding (e.g. OCRBench increased from 58.3 to 61.2)
Improved object detection, especially for small objects
More detailed and diverse outputs for image tagging queries

Bug Fixes

Fixed token streaming bug affecting multi-byte unicode characters

Developer

gpt-fast style compile() now supported in HF Transformers implementation

Moondream 2025-01-08 Release2025-01-08

Structured Output

Support for structured output formats such as JSON, XML, Markdown and CSV.

New Capability: Gaze Detection

Experimental capability that tracks human attention in images.

Driver gaze detection for automotive applications
Sports gaze detection for analyzing player focus

Benchmark Improvements

Significant improvements across industry benchmarks.

Better OCR

Improved vision layer for better text reading capabilities
Enhanced document querying and understanding
Better chart and diagram interpretation

Moondream 2024-12-04 Release2024-12-04

Moondream 0.5B: World's Smallest VLM

0.5B parameters optimized for edge devices and mobile platforms
479 MiB compressed at 8-bit, 375 MiB at 4-bit
Memory usage: 996 MiB at 8-bit, 816 MiB at 4-bit
Released under Apache License

Moondream 2024-11-25 Release2024-11-25

Playground Launch

Improved user experience with automatic prompt suggester
Visual Question Answering (VQA) for human-like responses
Object detection with bounding box coordinates
Image captioning for annotations

Moondream 2024-07-23 Release2024-07-23

OCR & Document Improvements

Significant improvements in OCR and document understanding
Optimized for local runtime performance

Moondream 2024-04-02 Release2024-04-02

Enhanced OCR & Captioning

Improved OCR capabilities for better text recognition
Enhanced image captioning for more detailed descriptions

Moondream 2024-03-04 Initial Release2024-03-04

Initial Release

1.8B parameters vision language model
Optimized for edge devices
Less than 5GB memory required in 16-bit precision
Basic visual understanding capabilities