Open‑Vocabulary Object Detection

Moondream Object Detection understands natural language to locate any object in your images. Fast, accurate bounding boxes powered by visual understanding.

Open-vocabulary

Detect anything you can describe

Traditional object detection relies on predefined classes. Need to find “damaged boxes” or “person wearing red”? If it's not in the training set, you're stuck retraining the model.

Moondream understands language. Describe what you're looking for in plain English and get accurate bounding boxes instantly. No retraining required.

Real-world demos

Built for production use cases

Same API, endless applications. See Object Detection across different domains.

Use Case

Damage Detection

Use Case

Robotics

Use Case

Computer Use

Use Case

Security & Safety

Benchmarks

Fast, accurate object detection

Moondream 3 achieves the highest scores on standard grounding benchmarks while being faster and more cost-effective.

Moondream
GPT-5
Gemini 2.5 Flash
Claude 4 Sonnet
RefCOCO
91.1
57.2
75.8
30.1
RefCOCOg
88.6
49.8
75.1
26.2
RefCOCO+
81.8
46.3
70.2
23.4

Other Moondream Skills
How it works

How object detection works

Describe what you're looking for in natural language and get precise bounding boxes instantly.

Detect

"dirty dishes"

Output

788ms • 759 tokens • $0.000289
Try it
{
  "objects": [
    {
      "x_min": 0.422,
      "y_min": 0.579,
      "x_max": 0.704,
      "y_max": 0.906
    }
  ]
}

Code

import moondream as md
from PIL import Image

# Initialize with API key
model = md.vl(api_key="your-api-key")

# Load your image
image = Image.open("kitchen.jpg")

# Detect objects using natural language
result = model.detect(image, "dirty dishes")
print(result["objects"])
FAQ

FAQ

Common questions about Object Detection, pricing, and integration.

Point returns specific (x, y) coordinates and Segment returns pixel-accurate polygons. Object Detection returns bounding boxes (rectangles) around objects, giving you fast localization without the overhead of precise pixel boundaries.

Yes. Object Detection accepts natural language descriptions like "person wearing a hard hat" or "damaged products on the shelf." You can describe objects by attributes, position, color, and relationships.

Yes. Object Detection returns all instances matching your description in a single query. For example, searching for "red cars" will locate every red car in the image.

Object Detection returns normalized bounding box coordinates (x, y, width, height) for each detected object, making it easy to integrate with downstream systems and pipelines.

Object Detection uses the same per-token pricing as all other Moondream skills. Every Moondream Cloud account includes $5 in free monthly credits to experiment and build.

Yes. Object Detection is available in both Moondream Cloud and the downloadable model, giving you flexibility to run it wherever you need.

Moondream outperforms YOLOv11, OWL-ViT, and Gemini 2.5 Flash on open-vocabulary detection benchmarks like COCO and LVIS. Unlike traditional detectors that require predefined classes, Moondream understands natural language and adapts to any description.

Yes. Object Detection operates on still images and can be applied frame-by-frame to video. You can use the same prompt on each frame to build a video object tracking workflow.
Running into problems or need help? Come reach us on Discord
Join Discord