Vision AI that can answer questions at the speed of sight

Moondream Query answers natural language questions about images with detailed, accurate responses. From OCR to scene understanding, get the answers you need in milliseconds.

Security monitoring
Q"what is happening?"
a man is trying to break into a white car
Brand detection
Q"what brands are visible?"
Coca-Cola
Category classification
Q"category: dining out or groceries?"
dining out
Wildlife observation
Q"describe what you see"
Two cheetahs lying in dry grass
Retail analytics
Q"workers and customers as json"
{ "workers": 1, "customers": 5 }
Security monitoring
Q"what is happening?"
a man is trying to break into a white car
Brand detection
Q"what brands are visible?"
Coca-Cola
Category classification
Q"category: dining out or groceries?"
dining out
Wildlife observation
Q"describe what you see"
Two cheetahs lying in dry grass
Retail analytics
Q"workers and customers as json"
{ "workers": 1, "customers": 5 }
Natural language

Your question is the only code you need

Traditional computer vision requires predefined categories and outputs. Need to extract “total amount due” or answer “is this person wearing PPE?”? If it's not in the training set, you're stuck building custom pipelines.

Moondream understands language, not just labels. Ask any question about any image and get detailed, accurate answers instantly. From OCR to scene understanding, your question is the only limit.

Real-world demos

Visual Understanding, Instantly

Same API, endless applications. See Query across different domains.

Security camera footage
Security Monitoring
Query"what is happening?"
Responsea man is trying to break into a white car
Digital invoice
Document Extraction
Query"return as json: total, merchant, item"
Response{ "total": 20.0, "merchant": "Anthropic, PBC", "item": "Claude Pro" }
Construction worker with hard hat
Safety Compliance
Query"what is the worker doing?"
ResponseInspecting equipment while wearing a hard hat, noting details on a clipboard.
Golden Gate Bridge
Media Tagging
Query"provide 5 tags"
Response["Golden Gate Bridge", "suspension bridge", "orange", "water", "sunrise"]
Other Moondream Skills
How it works

How query works

Ask any natural language question about an image and get detailed, accurate answers.

Security camera footage

Query

Try it

"what is happening?"

Output

485ms • 741 tokens • $0.000249
{
  "answer": "A man is trying to break into a white car"
}

Code

import moondream as md
from PIL import Image

# Initialize with API key
model = md.vl(api_key="your-api-key")

# Load your image
image = Image.open("security_camera.jpg")

# Ask a question about the image
result = model.query(image, "what is happening?")
print(result["answer"])
FAQ

FAQ

Common questions about Query, pricing, and integration.

Caption generates a general description of the entire image. Query lets you ask specific questions and get targeted answers about particular aspects of the image, from scene understanding to document extraction.

Query supports a wide range of questions including scene understanding ("What is happening?"), object counting ("How many cars are visible?"), document extraction ("What is the total amount?"), compliance checking ("Is this person wearing PPE?"), and more.

Yes. Query excels at extracting information from receipts, invoices, forms, and other documents. You can ask for specific fields or request structured output like JSON.

Query uses the same per-token pricing as all other Moondream skills. Every Moondream Cloud account includes $5 in free monthly credits to experiment and build.

Yes. Query is available in both Moondream Cloud and the downloadable model. You can run Query locally for free on your own hardware.

Moondream Query delivers sub-200ms latency for most queries, making it suitable for real-time applications like video analysis and interactive systems. This is significantly faster than larger models like GPT-4o or Claude.

Yes. Query operates on still images and can be applied frame-by-frame to video. Combined with its low latency, this enables real-time video understanding and monitoring applications.
Running into problems or need help? Come reach us on Discord
Join Discord