Introduction to Moondream - Documentation

Visual Querying

Answer natural language questions about any image with remarkable accuracy. Identify objects, understand relationships, and extract specific information from visual content with detailed responses based on what the model sees.

/caption

Rich Image Captioning

Generate detailed descriptions that capture the essence of any image, going beyond simple object identification to convey scene context, relationships, and even subtleties like mood or style—perfect for content management, accessibility, or creative applications.

/detect

Object Detection

Identify and locate objects within images with high precision, making it invaluable for applications in retail, inventory management, security, and analytics where understanding what objects are present and their positions is crucial.

/point

Visual Pointing

Refer to precise locations when asked about specific elements in an image, making it ideal for interactive applications where users need to identify or work with specific parts of visual content.