Open‑Vocabulary Object Detection

Moondream Object Detection understands natural language to locate any object in your images. Fast, accurate bounding boxes powered by visual understanding.

Try Object Detection View docs

Open-vocabulary

Detect anything you can describe

Traditional object detection relies on predefined classes. Need to find “damaged boxes” or “person wearing red”? If it's not in the training set, you're stuck retraining the model.

Moondream understands language. Describe what you're looking for in plain English and get accurate bounding boxes instantly. No retraining required.

Real-world demos

Built for production use cases

Same API, endless applications. See Object Detection across different domains.

Use Case

Damage Detection

Use Case

Robotics

Use Case

Computer Use

Use Case

Security & Safety

Benchmarks

Fast, accurate object detection

Moondream 3 achieves the highest scores on standard grounding benchmarks while being faster and more cost-effective.

	Moondream	GPT-5	Gemini 2.5 Flash	Claude 4 Sonnet
RefCOCO	91.1	57.2	75.8	30.1
RefCOCOg	88.6	49.8	75.1	26.2
RefCOCO+	81.8	46.3	70.2	23.4

Other Moondream Skills

Query

Answers questions about the image.

Segmentation

Returns pixel-accurate polygons.

Caption

Describes the image.

Point

Returns 2D (x, y) coordinates.

How it works

How object detection works

Describe what you're looking for in natural language and get precise bounding boxes instantly.

Detect

"dirty dishes"

Output

788ms • 759 tokens • $0.000289

Try it

{
  "objects": [
    {
      "x_min": 0.422,
      "y_min": 0.579,
      "x_max": 0.704,
      "y_max": 0.906
    }
  ]
}

Code

import moondream as md
from PIL import Image

# Initialize with API key
model = md.vl(api_key="your-api-key")

# Load your image
image = Image.open("kitchen.jpg")

# Detect objects using natural language
result = model.detect(image, "dirty dishes")
print(result["objects"])

FAQ

Point returns specific (x, y) coordinates and Segment returns pixel-accurate polygons. Object Detection returns bounding boxes (rectangles) around objects, giving you fast localization without the overhead of precise pixel boundaries.

Yes. Object Detection accepts natural language descriptions like "person wearing a hard hat" or "damaged products on the shelf." You can describe objects by attributes, position, color, and relationships.

Yes. Object Detection returns all instances matching your description in a single query. For example, searching for "red cars" will locate every red car in the image.

Object Detection returns normalized bounding box coordinates (x, y, width, height) for each detected object, making it easy to integrate with downstream systems and pipelines.

Object Detection uses the same per-token pricing as all other Moondream skills. Every Moondream Cloud account includes $5 in free monthly credits to experiment and build.

Yes. Object Detection is available in both Moondream Cloud and the downloadable model, giving you flexibility to run it wherever you need.

Moondream outperforms YOLOv11, OWL-ViT, and Gemini 2.5 Flash on open-vocabulary detection benchmarks like COCO and LVIS. Unlike traditional detectors that require predefined classes, Moondream understands natural language and adapts to any description.

Yes. Object Detection operates on still images and can be applied frame-by-frame to video. You can use the same prompt on each frame to build a video object tracking workflow.

Running into problems or need help? Come reach us on Discord

Join Discord

Open‑Vocabulary Object Detection

Detect anything you can describe

Built for production use cases

Damage Detection

Robotics

Computer Use

Security & Safety

Fast, accurate object detection

Methodology & Notes

Query

Segmentation

Caption

Point

How object detection works

Output

Code

FAQ

How is Object Detection different from Point and Segment?

Does Object Detection support open-vocabulary prompts?

Can Object Detection find multiple instances?

What format does Object Detection use for its outputs?

How does pricing work?

Is Object Detection available in the downloadable Moondream model?

How does Moondream compare to YOLO and other detectors?

Can I use Object Detection for video?