Changelog

This page documents all notable changes to Moondream.

Improvements
  • Improved chart understanding (ChartQA up from 74.8 to 77.5, 82.2 with PoT)
  • Added temperature and nucleus sampling to reduce repetitive outputs
  • Better OCR for documents and tables (prompt with "Transcribe the text" or "Transcribe the text in natural reading order")
  • Object detection supports document layout detection (figure, formula, text, etc)
  • UI understanding (ScreenSpot F1@0.5 up from 53.3 to 60.3)
  • Improved text understanding (DocVQA up from 76.5 to 79.3, TextVQA up from 74.6 to 76.3)
Quantization Aware Training
  • 4-bit model with quantization-aware training for faster inference and lower memory use
  • Runs at 184 tokens/second on an RTX 3090 with 2.4 GB memory usage (42% less than full precision)
  • Only a 0.6% drop in accuracy (74.5 vs 74.9 average score)
Improvements
  • Added support for long-form captioning
  • Open vocabulary image tagging
  • Improved counting accuracy (e.g. CountBenchQA increased from 80 to 86.4)
  • Improved text understanding (e.g. OCRBench increased from 58.3 to 61.2)
  • Improved object detection, especially for small objects (e.g. COCO up from 30.5 to 51.2)
Bug Fixes
  • Fixed token streaming bug affecting multi-byte unicode characters
Developer
  • gpt-fast style compile() now supported in HF Transformers implementation
Structured Output

Support for structured output formats such as JSON, XML, Markdown and CSV.

New Capability: Gaze Detection

Experimental capability that tracks human attention in images.

  • Driver gaze detection for automotive applications
  • Sports gaze detection for analyzing player focus
Benchmark Improvements

Significant improvements across industry benchmarks.

Better OCR
  • Improved vision layer for better text reading capabilities
  • Enhanced document querying and understanding
  • Better chart and diagram interpretation
Moondream 0.5B: World's Smallest VLM
  • 0.5B parameters optimized for edge devices and mobile platforms
  • 479 MiB compressed at 8-bit, 375 MiB at 4-bit
  • Memory usage: 996 MiB at 8-bit, 816 MiB at 4-bit
  • Released under Apache License
Moondream 2024-07-23 Release2024-07-23
OCR & Document Improvements
  • Significant improvements in OCR and document understanding
  • Optimized for local runtime performance
Moondream 2024-04-02 Release2024-04-02
Enhanced OCR & Captioning
  • Improved OCR capabilities for better text recognition
  • Enhanced image captioning for more detailed descriptions
Moondream 2024-03-04 Initial Release2024-03-04
Initial Release
  • 1.8B parameters vision language model
  • Optimized for edge devices
  • Less than 5GB memory required in 16-bit precision
  • Basic visual understanding capabilities