Accurate Captions, Fully Automated, at Any Scale

Generate rich descriptions for images and video frames. Choose from short, normal, or long captions to match your use case—from quick labels to detailed descriptions.

Dog with frisbee
Captionshort

A brown dog holds a frisbee on a sidewalk next to yellow flowers.

Nike Air Force 1 shoe
Captionnormal

A white Nike Air Force 1 high-top sneaker photographed against a plain background. The shoe features the classic Air Force 1 silhouette with a perforated toe box, padded ankle collar, and the iconic Nike Swoosh logo on the side. The thick white rubber sole and clean leather upper give it a timeless streetwear aesthetic.

Moose on trail camera
Captionshort

A moose with large antlers walks on a dirt path in a forest.

Airport terminal
Captionlong

A bustling airport terminal interior captured from an elevated perspective, showing travelers navigating through a spacious concourse. The architecture features a dramatic vaulted ceiling with exposed structural beams and large glass panels allowing natural light to flood the space. Multiple departure gates line both sides of the terminal, with digital flight information displays visible above each gate. Travelers of various ages pull rolling luggage while others rest on rows of connected seating. The polished floor reflects the overhead lighting, creating a sense of depth and movement. In the background, retail shops and food vendors are visible along the terminal walls. The scene captures the organized chaos of modern air travel during what appears to be a peak travel period.

Kitchen scene
Captionnormal

A kitchen counter viewed from above showing a stainless steel sink containing dirty dishes. Various plates, bowls, and utensils are stacked in the sink basin. The surrounding countertop is light-colored with some water spots visible. A faucet with a chrome finish is positioned at the back of the sink.

Dog with frisbee
Captionshort

A brown dog holds a frisbee on a sidewalk next to yellow flowers.

Nike Air Force 1 shoe
Captionnormal

A white Nike Air Force 1 high-top sneaker photographed against a plain background. The shoe features the classic Air Force 1 silhouette with a perforated toe box, padded ankle collar, and the iconic Nike Swoosh logo on the side. The thick white rubber sole and clean leather upper give it a timeless streetwear aesthetic.

Moose on trail camera
Captionshort

A moose with large antlers walks on a dirt path in a forest.

Airport terminal
Captionlong

A bustling airport terminal interior captured from an elevated perspective, showing travelers navigating through a spacious concourse. The architecture features a dramatic vaulted ceiling with exposed structural beams and large glass panels allowing natural light to flood the space. Multiple departure gates line both sides of the terminal, with digital flight information displays visible above each gate. Travelers of various ages pull rolling luggage while others rest on rows of connected seating. The polished floor reflects the overhead lighting, creating a sense of depth and movement. In the background, retail shops and food vendors are visible along the terminal walls. The scene captures the organized chaos of modern air travel during what appears to be a peak travel period.

Kitchen scene
Captionnormal

A kitchen counter viewed from above showing a stainless steel sink containing dirty dishes. Various plates, bowls, and utensils are stacked in the sink basin. The surrounding countertop is light-colored with some water spots visible. A faucet with a chrome finish is positioned at the back of the sink.

Three lengths

Rich descriptions, no prompts required

Process millions of images with sub-400ms latency. Three length options—short, normal, and long—let you balance speed and detail for any workflow.

Whether you're labeling a media library, generating training data, or building accessibility features, Moondream Caption delivers consistent quality at scale without breaking the bank.

Real-world demos

Built for production use cases

Same API, endless applications. See Image Captioning across different domains.

Nike shoe product
Product Cataloging
Captionshort

White Nike Air Force 1 high-top sneaker with classic design.

Dog with frisbee
Accessibility Alt-Text
Captionnormal

A brown dog wearing a blue harness and leash holds a green and blue frisbee in its mouth while standing on a gray sidewalk. A bouquet of yellow flowers is visible nearby, adding a splash of color to the scene.

Moose trail camera
Content Indexing
Captionshort

A moose with large antlers walks on a dirt path in a forest.

Airport terminal
Synthetic Data
Captionlong

A bustling airport terminal interior captured from an elevated perspective, showing travelers navigating through a spacious concourse. The architecture features a dramatic vaulted ceiling with exposed structural beams and large glass panels allowing natural light to flood the space. Multiple departure gates line both sides of the terminal. Travelers of various ages pull rolling luggage while others rest on rows of connected seating. The polished floor reflects the overhead lighting.

Other Moondream skills
How it works

How captioning works

Upload an image and choose your caption length: short for fast labels, normal for balanced detail, or long for comprehensive descriptions.

Dog with frisbee
Captionnormal

A brown dog wearing a blue harness and leash holds a green and blue frisbee in its mouth while standing on a gray sidewalk. A bouquet of yellow flowers is visible nearby, adding color to the urban scene.

Length:

Output

385ms • 771 tokens • $0.000324
Try it
{
  "caption": "A brown dog wearing a blue harness and leash holds a green and blue frisbee in its mouth while standing on a gray sidewalk. A bouquet of yellow flowers is visible nearby, adding color to the urban scene."
}

Code

import moondream as md
from PIL import Image

# Initialize with API key
model = md.vl(api_key="your-api-key")

# Load your image
image = Image.open("photo.jpg")

# Short caption (~25 words) - fast labels
short = model.caption(image, length="short")
print(short["caption"])

# Normal caption (~80 words) - balanced detail
normal = model.caption(image, length="normal")
print(normal["caption"])

# Long caption (~180 words) - comprehensive
long = model.caption(image, length="long")
print(long["caption"])
FAQ

FAQ

Common questions about Caption, length options, and batch processing.

Short (~25 words) gives you quick labels for search indexing and tagging. Normal (~80 words) provides balanced descriptions ideal for accessibility alt-text. Long (~180 words) delivers comprehensive narratives for synthetic data generation, robotics, and detailed documentation.

Caption generates a general description of the entire image automatically—no prompt needed. Query lets you ask specific questions to get targeted answers about particular aspects. Caption is optimized for batch processing; Query is for interactive analysis.

Yes. Caption is designed for high-volume batch processing. With sub-400ms latency and low per-image costs, you can process large media libraries efficiently. The short length option is optimized for throughput when you need fast labels at scale.

Yes. Moondream can read and incorporate visible text into captions, making it useful for documents, signs, product labels, and screenshots where text provides important context.

Caption uses per-token pricing. Short captions typically use ~20 output tokens; normal captions use ~60 tokens; long captions use ~150 tokens. Every Moondream Cloud account includes $5 in free monthly credits.

Yes. Caption is available in both Moondream Cloud and the downloadable model. For large batch jobs, you can run Caption locally on your own GPUs to eliminate per-request costs entirely.

Yes. Caption operates on still images and can be applied to key frames extracted from video. Combined with its low latency, this enables video summarization, scene labeling, and content indexing workflows.

Use "normal" length for accessibility alt-text. It generates descriptive, contextual captions (~80 words) that work well for screen readers, covering visual elements in a way that's meaningful to users who cannot see the image.
Running into problems or need help? Come reach us on Discord
Join Discord