Pinpoint and count
anything you can type

Moondream Point returns exact (x, y) coordinates for every object you describe. Dedicated grounding tokens mean minimal overhead—count defects, click buttons, guide grippers. Fast, cheap, actionable.

Try Point View docs

Minimal output

The lightest localization primitive

Traditional models output coordinates as verbose text—hundreds of tokens to describe a single location. Need batch processing or real-time inference? You're paying for every character.

Moondream uses dedicated grounding tokens. Two tokens per point. No overhead, no parsing, just normalized (x, y) coordinates ready for robots, UI automation, or counting pipelines.

Real-world demos

Precision ready for production

Same API, endless applications. See Point in action across different domains.

Use Case

UI automation

Use Case

Robotics

Use Case

Industrial

Use Case

Damage detection

Other Moondream Skills

Query

Answers questions about the image.

Segmentation

Returns pixel-accurate polygons.

Caption

Describes the image.

Object Detection

Returns bounding rectangles.

How it works

How point works

Describe what you're looking for in natural language and get precise (x, y) coordinates instantly.

"damaged cookies"

Output

738ms • 744 tokens • $0.000239

Try it

{
  "points": [
    { "x": 0.250, "y": 0.789 },
    { "x": 0.373, "y": 0.548 },
    { "x": 0.674, "y": 0.360 }
  ]
}

Code

import moondream as md
from PIL import Image

# Initialize with API key
model = md.vl(api_key="your-api-key")

# Load your image
image = Image.open("production_line.jpg")

# Point to defects using natural language
result = model.point(image, "damaged cookies")
print(result["points"])

FAQ

Point returns a single (x, y) coordinate for the location you describe, Object Detection returns bounding boxes, and Segment returns pixel-accurate polygons. Point is ideal when you need the exact center or a specific feature of an object, without the overhead of boundaries or masks.

Yes. Point accepts natural language descriptions like "center of the plate," "tip of the pencil," or "animal's nose." You can describe any feature, position, or spatial relationship.

Yes. Point can return coordinates for multiple instances matching your description. For example, asking for "centers of all the apples" will return coordinates for each apple in the image.

Point returns normalized (x, y) coordinates between 0 and 1, making it easy to scale to any image resolution or integrate with downstream systems.

Point excels at robotic grasping (finding grip points), UI automation (clicking specific elements), spatial analysis (measuring distances), and any application that needs exact position data without the overhead of full object boundaries.

Point uses the same per-token pricing as all other Moondream skills. Every Moondream Cloud account includes $5 in free monthly credits to experiment and build.

Yes. Point is available in both Moondream Cloud and the downloadable model, giving you flexibility to run it wherever you need.

Point leverages the same grounding capabilities as Moondream's Object Detection, achieving state-of-the-art accuracy on standard benchmarks like RefCOCO, RefCOCOg, and RefCOCO+. Coordinates are precise to the pixel level.

Running into problems or need help? Come reach us on Discord

Join Discord

Pinpoint and countanything you can type

The lightest localization primitive

Precision ready for production

UI automation

Robotics

Industrial

Damage detection

Query

Segmentation

Caption

Object Detection

How point works

Output

Code

FAQ

How is Point different from Object Detection and Segment?

Does Point support open-vocabulary prompts?

Can Point return multiple coordinates?

What format does Point use for its outputs?

What are the best use cases for Point?

How does pricing work?

Is Point available in the downloadable Moondream model?

How accurate is Point?

Pinpoint and count
anything you can type