Accurate Captions, Fully Automated, at Any Scale

Generate rich descriptions for images and video frames. Choose from short, normal, or long captions to match your use case—from quick labels to detailed descriptions.

Try Caption View docs

Caption• short

A brown dog holds a frisbee on a sidewalk next to yellow flowers.

Caption• normal

A white Nike Air Force 1 high-top sneaker photographed against a plain background. The shoe features the classic Air Force 1 silhouette with a perforated toe box, padded ankle collar, and the iconic Nike Swoosh logo on the side. The thick white rubber sole and clean leather upper give it a timeless streetwear aesthetic.

Caption• short

A moose with large antlers walks on a dirt path in a forest.

Caption• long

A bustling airport terminal interior captured from an elevated perspective, showing travelers navigating through a spacious concourse. The architecture features a dramatic vaulted ceiling with exposed structural beams and large glass panels allowing natural light to flood the space. Multiple departure gates line both sides of the terminal, with digital flight information displays visible above each gate. Travelers of various ages pull rolling luggage while others rest on rows of connected seating. The polished floor reflects the overhead lighting, creating a sense of depth and movement. In the background, retail shops and food vendors are visible along the terminal walls. The scene captures the organized chaos of modern air travel during what appears to be a peak travel period.

Caption• normal

A kitchen counter viewed from above showing a stainless steel sink containing dirty dishes. Various plates, bowls, and utensils are stacked in the sink basin. The surrounding countertop is light-colored with some water spots visible. A faucet with a chrome finish is positioned at the back of the sink.

Caption• short

A brown dog holds a frisbee on a sidewalk next to yellow flowers.

Caption• normal

Caption• short

A moose with large antlers walks on a dirt path in a forest.

Caption• long

Caption• normal

Three lengths

Rich descriptions, no prompts required

Process millions of images with sub-400ms latency. Three length options—short, normal, and long—let you balance speed and detail for any workflow.

Whether you're labeling a media library, generating training data, or building accessibility features, Moondream Caption delivers consistent quality at scale without breaking the bank.

Real-world demos

Built for production use cases

Same API, endless applications. See Image Captioning across different domains.

Product Cataloging

Caption• short

White Nike Air Force 1 high-top sneaker with classic design.

Accessibility Alt-Text

Caption• normal

A brown dog wearing a blue harness and leash holds a green and blue frisbee in its mouth while standing on a gray sidewalk. A bouquet of yellow flowers is visible nearby, adding a splash of color to the scene.

Content Indexing

Caption• short

A moose with large antlers walks on a dirt path in a forest.

Synthetic Data

Caption• long

Other Moondream skills

Query

Answers questions about the image.

Segment

Returns pixel-accurate SVG polygons.

Object Detect

Returns bounding rectangles.

Point

Returns 2D (x, y) coordinates.

How it works

How captioning works

Upload an image and choose your caption length: short for fast labels, normal for balanced detail, or long for comprehensive descriptions.

Caption• normal

A brown dog wearing a blue harness and leash holds a green and blue frisbee in its mouth while standing on a gray sidewalk. A bouquet of yellow flowers is visible nearby, adding color to the urban scene.

Length:

Output

385ms • 771 tokens • $0.000324

Try it

{
  "caption": "A brown dog wearing a blue harness and leash holds a green and blue frisbee in its mouth while standing on a gray sidewalk. A bouquet of yellow flowers is visible nearby, adding color to the urban scene."
}

Code

import moondream as md
from PIL import Image

# Initialize with API key
model = md.vl(api_key="your-api-key")

# Load your image
image = Image.open("photo.jpg")

# Short caption (~25 words) - fast labels
short = model.caption(image, length="short")
print(short["caption"])

# Normal caption (~80 words) - balanced detail
normal = model.caption(image, length="normal")
print(normal["caption"])

# Long caption (~180 words) - comprehensive
long = model.caption(image, length="long")
print(long["caption"])

FAQ

Short (~25 words) gives you quick labels for search indexing and tagging. Normal (~80 words) provides balanced descriptions ideal for accessibility alt-text. Long (~180 words) delivers comprehensive narratives for synthetic data generation, robotics, and detailed documentation.

Caption generates a general description of the entire image automatically—no prompt needed. Query lets you ask specific questions to get targeted answers about particular aspects. Caption is optimized for batch processing; Query is for interactive analysis.

Yes. Caption is designed for high-volume batch processing. With sub-400ms latency and low per-image costs, you can process large media libraries efficiently. The short length option is optimized for throughput when you need fast labels at scale.

Yes. Moondream can read and incorporate visible text into captions, making it useful for documents, signs, product labels, and screenshots where text provides important context.

Caption uses per-token pricing. Short captions typically use ~20 output tokens; normal captions use ~60 tokens; long captions use ~150 tokens. Every Moondream Cloud account includes $5 in free monthly credits.

Yes. Caption is available in both Moondream Cloud and the downloadable model. For large batch jobs, you can run Caption locally on your own GPUs to eliminate per-request costs entirely.

Yes. Caption operates on still images and can be applied to key frames extracted from video. Combined with its low latency, this enables video summarization, scene labeling, and content indexing workflows.

Use "normal" length for accessibility alt-text. It generates descriptive, contextual captions (~80 words) that work well for screen readers, covering visual elements in a way that's meaningful to users who cannot see the image.

Running into problems or need help? Come reach us on Discord

Join Discord

Accurate Captions, Fully Automated, at Any Scale

Rich descriptions, no prompts required

Built for production use cases

Query

Segment

Object Detect

Point

How captioning works

Output

Code

FAQ

What are the three caption length options?

How is Caption different from Query?

Can Caption handle millions of images?

Does Caption work with text in images?

How does pricing work?

Is Caption available in the downloadable Moondream model?

Can I use Caption for video?

Which length should I use for accessibility alt-text?