Pinpoint and count
anything you can type

Moondream Point returns exact (x, y) coordinates for every object you describe. Dedicated grounding tokens mean minimal overhead—count defects, click buttons, guide grippers. Fast, cheap, actionable.

Minimal output

The lightest localization primitive

Traditional models output coordinates as verbose text—hundreds of tokens to describe a single location. Need batch processing or real-time inference? You're paying for every character.

Moondream uses dedicated grounding tokens. Two tokens per point. No overhead, no parsing, just normalized (x, y) coordinates ready for robots, UI automation, or counting pipelines.

Real-world demos

Precision ready for production

Same API, endless applications. See Point in action across different domains.

Use Case

UI automation

Use Case

Robotics

Use Case

Industrial

Use Case

Damage detection

Other Moondream Skills
How it works

How point works

Describe what you're looking for in natural language and get precise (x, y) coordinates instantly.

"damaged cookies"

Output

738ms • 744 tokens • $0.000239
Try it
{
  "points": [
    { "x": 0.250, "y": 0.789 },
    { "x": 0.373, "y": 0.548 },
    { "x": 0.674, "y": 0.360 }
  ]
}

Code

import moondream as md
from PIL import Image

# Initialize with API key
model = md.vl(api_key="your-api-key")

# Load your image
image = Image.open("production_line.jpg")

# Point to defects using natural language
result = model.point(image, "damaged cookies")
print(result["points"])
FAQ

FAQ

Common questions about Point, pricing, and integration.

Point returns a single (x, y) coordinate for the location you describe, Object Detection returns bounding boxes, and Segment returns pixel-accurate polygons. Point is ideal when you need the exact center or a specific feature of an object, without the overhead of boundaries or masks.

Yes. Point accepts natural language descriptions like "center of the plate," "tip of the pencil," or "animal's nose." You can describe any feature, position, or spatial relationship.

Yes. Point can return coordinates for multiple instances matching your description. For example, asking for "centers of all the apples" will return coordinates for each apple in the image.

Point returns normalized (x, y) coordinates between 0 and 1, making it easy to scale to any image resolution or integrate with downstream systems.

Point excels at robotic grasping (finding grip points), UI automation (clicking specific elements), spatial analysis (measuring distances), and any application that needs exact position data without the overhead of full object boundaries.

Point uses the same per-token pricing as all other Moondream skills. Every Moondream Cloud account includes $5 in free monthly credits to experiment and build.

Yes. Point is available in both Moondream Cloud and the downloadable model, giving you flexibility to run it wherever you need.

Point leverages the same grounding capabilities as Moondream's Object Detection, achieving state-of-the-art accuracy on standard benchmarks like RefCOCO, RefCOCOg, and RefCOCO+. Coordinates are precise to the pixel level.
Running into problems or need help? Come reach us on Discord
Join Discord