Changelog
This page documents all notable changes to Moondream.
Moondream 2025-03-27 Release2025-03-27
Improvements
- Added support for long-form captioning
- Improved counting accuracy (e.g. CountBenchQA increased from 80 to 86.4)
- Improved text understanding (e.g. OCRBench increased from 58.3 to 61.2)
- Improved object detection, especially for small objects
- More detailed and diverse outputs for image tagging queries
Bug Fixes
- Fixed token streaming bug affecting multi-byte unicode characters
Developer
- gpt-fast style
compile()
now supported in HF Transformers implementation
Moondream 2025-01-08 Release2025-01-08
Structured Output
Support for structured output formats such as JSON, XML, Markdown and CSV.
New Capability: Gaze Detection
Experimental capability that tracks human attention in images.
- Driver gaze detection for automotive applications
- Sports gaze detection for analyzing player focus
Benchmark Improvements
Significant improvements across industry benchmarks.
Better OCR
- Improved vision layer for better text reading capabilities
- Enhanced document querying and understanding
- Better chart and diagram interpretation
Moondream 2024-12-04 Release2024-12-04
Moondream 0.5B: World's Smallest VLM
- 0.5B parameters optimized for edge devices and mobile platforms
- 479 MiB compressed at 8-bit, 375 MiB at 4-bit
- Memory usage: 996 MiB at 8-bit, 816 MiB at 4-bit
- Released under Apache License
Moondream 2024-11-25 Release2024-11-25
Playground Launch
- Improved user experience with automatic prompt suggester
- Visual Question Answering (VQA) for human-like responses
- Object detection with bounding box coordinates
- Image captioning for annotations
Moondream 2024-07-23 Release2024-07-23
OCR & Document Improvements
- Significant improvements in OCR and document understanding
- Optimized for local runtime performance
Moondream 2024-04-02 Release2024-04-02
Enhanced OCR & Captioning
- Improved OCR capabilities for better text recognition
- Enhanced image captioning for more detailed descriptions
Moondream 2024-03-04 Initial Release2024-03-04
Initial Release
- 1.8B parameters vision language model
- Optimized for edge devices
- Less than 5GB memory required in 16-bit precision
- Basic visual understanding capabilities