Vision AI that can answer questions at the speed of sight

Moondream Query answers natural language questions about images with detailed, accurate responses. From OCR to scene understanding, get the answers you need in milliseconds.

Try Query View docs

Q"what is happening?"

→a man is trying to break into a white car

Q"what brands are visible?"

→Coca-Cola

Q"category: dining out or groceries?"

→dining out

Q"describe what you see"

→Two cheetahs lying in dry grass

Q"workers and customers as json"

→{ "workers": 1, "customers": 5 }

Q"what is happening?"

→a man is trying to break into a white car

Q"what brands are visible?"

→Coca-Cola

Q"category: dining out or groceries?"

→dining out

Q"describe what you see"

→Two cheetahs lying in dry grass

Q"workers and customers as json"

→{ "workers": 1, "customers": 5 }

Natural language

Your question is the only code you need

Traditional computer vision requires predefined categories and outputs. Need to extract “total amount due” or answer “is this person wearing PPE?”? If it's not in the training set, you're stuck building custom pipelines.

Moondream understands language, not just labels. Ask any question about any image and get detailed, accurate answers instantly. From OCR to scene understanding, your question is the only limit.

Real-world demos

Visual Understanding, Instantly

Same API, endless applications. See Query across different domains.

Security Monitoring

Query"what is happening?"

Responsea man is trying to break into a white car

Document Extraction

Query"return as json: total, merchant, item"

Response{ "total": 20.0, "merchant": "Anthropic, PBC", "item": "Claude Pro" }

Safety Compliance

Query"what is the worker doing?"

ResponseInspecting equipment while wearing a hard hat, noting details on a clipboard.

Media Tagging

Query"provide 5 tags"

Response["Golden Gate Bridge", "suspension bridge", "orange", "water", "sunrise"]

Other Moondream Skills

Segment

Returns pixel-accurate SVG polygons.

Object Detect

Returns bounding rectangles.

Caption

Describes the image.

Point

Returns 2D (x, y) coordinates.

How it works

How query works

Ask any natural language question about an image and get detailed, accurate answers.

Query

Try it

"what is happening?"

Output

485ms • 741 tokens • $0.000249

{
  "answer": "A man is trying to break into a white car"
}

Code

import moondream as md
from PIL import Image

# Initialize with API key
model = md.vl(api_key="your-api-key")

# Load your image
image = Image.open("security_camera.jpg")

# Ask a question about the image
result = model.query(image, "what is happening?")
print(result["answer"])

FAQ

Caption generates a general description of the entire image. Query lets you ask specific questions and get targeted answers about particular aspects of the image, from scene understanding to document extraction.

Query supports a wide range of questions including scene understanding ("What is happening?"), object counting ("How many cars are visible?"), document extraction ("What is the total amount?"), compliance checking ("Is this person wearing PPE?"), and more.

Yes. Query excels at extracting information from receipts, invoices, forms, and other documents. You can ask for specific fields or request structured output like JSON.

Query uses the same per-token pricing as all other Moondream skills. Every Moondream Cloud account includes $5 in free monthly credits to experiment and build.

Yes. Query is available in both Moondream Cloud and the downloadable model. You can run Query locally for free on your own hardware.

Moondream Query delivers sub-200ms latency for most queries, making it suitable for real-time applications like video analysis and interactive systems. This is significantly faster than larger models like GPT-4o or Claude.

Yes. Query operates on still images and can be applied frame-by-frame to video. Combined with its low latency, this enables real-time video understanding and monitoring applications.

Running into problems or need help? Come reach us on Discord

Join Discord

Vision AI that can answer questions at the speed of sight

Your question is the only code you need

Visual Understanding, Instantly

Segment

Object Detect

Caption

Point

How query works

Query

Output

Code

FAQ

How is Query different from Caption?

What types of questions can I ask?

Can Query extract structured data from documents?

How does pricing work?

Is Query available in the downloadable Moondream model?

How fast is Query compared to other VLMs?

Can I use Query for video analysis?