Moondream Lens: finetuning

Moondream is good at a lot of things. Fine-tune it to be great at yours.

Lens is Moondream's hosted fine-tuning API. Bring your task and labeled examples, Lens trains a Moondream variant that beats general-purpose VLMs on the work you actually care about. It runs anywhere from a Jetson at the edge to our cloud.

Read the docs

Open weightsSFT and RLEdge to datacenterCloud or on-device

Query

Better answers to questions about your images. Classify by your categories, read your forms, recognize your products.

Structured catalog data, every time

RetailMerchandisingInventorySearch

Prompt

Return JSON with: category, primary_color, accents, closure, sole_material, use_case.

BeforeBase Moondream

Base Moondream

category:       Sneakers
primary_color:  Off-white
accents:        coral
closure:        lace-up
sole_material:  rubber
use_case:       casual wear

AfterFine-tuned with Lens

Fine-tuned with Lens

category:       running shoe
primary_color:  white
accents:        orange
closure:        lace-up
sole_material:  rubber
use_case:       road running

How it learned

Sneakers→running shoe

Corrects category to the taxonomy your catalog actually uses. Generic labels become the specific class your team merchandises against.

Off-white→white

Reflects retail color conventions, where customers shop and filter by the family name, not the literal pixel value.

coral→orange

Snaps fashion descriptors back to the swatch language used across product pages, search facets, and merchandising tags.

Caption

Captions in your style and voice. Describe what matters for your use case, skip what doesn't.

Same image, your voice

RetailEditorialMarketplacesSEO

Prompt

Caption this listing.

BeforeBase Moondream

Base Moondream

A modern kitchen features dark gray cabinets, light wood flooring, a stainless steel refrigerator, integrated oven and cooktop, and a marble backsplash.

AfterFine-tuned with Lens

Fine-tuned with Lens

Sleek modern kitchen featuring high-gloss charcoal cabinets, a striking waterfall quartz island, and a full-height marble backsplash. Stainless appliances, warm under-cabinet lighting, and wide-plank wood floors tie it all together.

How it learned

modern kitchen→Sleek modern kitchen

Picks up the editorial register your brand uses on product pages. Descriptive openers instead of bare nouns.

dark gray cabinets→high-gloss charcoal cabinets

Maps observed colors and finishes to the design vocabulary your merchandisers actually write. Finish first, then hue.

marble backsplash→waterfall quartz island, full-height marble backsplash

Learns the materials and architectural features your listings always call out, even when they sit at the edge of frame.

tie it all together→—

Avoid the cliché closers ("tie it all together", "a true entertainer's dream") your style guide bans. Add forbidden phrases to the dataset and the model stops producing them.

Detect

Find the objects you care about, ignore the rest. Cut false positives down to near zero.

Detect what matters, not just what's there

ManufacturingConstructionQuality controlBrand protection

Prompt

Detect: workers without hard hats.

BeforeBase Moondream

Base Moondream

5 boxes returned. 4 are false positives.
Workers wearing hard hats were flagged anyway.
Only 1 detection is correct.

AfterFine-tuned with Lens

Fine-tuned with Lens

1 box returned: the single worker without a hard hat.
0 false positives on workers wearing PPE.

How it learned

every person→workers without hard hats

Learns the negative class specifically. The base model knows what people are. The fine-tune learns which condition matters for your alert.

5 boxes, 4 wrong→1 box, 0 wrong

Cuts false positives by teaching the model the look of compliant workers in your environment: your lighting, your PPE styles, your camera angles.

generic confidence→production-stable

You can set a threshold and trust it. Base-model confidence scores are noisy across scenes; a fine-tune calibrates them against your data.

Point

More accurate clicks. Better grounding for agents and UI automation.

Pixel-accurate grounding for agents

Computer useUI automationAccessibilityQA

Prompt

Click on the second reel.

BeforeBase Moondream

Base Moondream

1 point returned, but it's for the second reel column-wise.

AfterFine-tuned with Lens

Fine-tuned with Lens

1 point returned for the correct (row-wise) reel.

How it learned

column-wise→row-wise

Learns how your users count. "Second reel" means top row, middle column in this app. The base model picks the next-best guess; the fine-tune learns the convention.

near-misses→center hits

Lands the click well inside the target's hit area, not on its edge. Critical for flaky touch targets and tightly packed grids.

app-agnostic→app-aware

Recognizes the layout structure of the specific app. Agents stop wandering through marginal UI affordances and go straight to the intended element.

Two ways to teach it

Pick the method that fits the data you have.

SFTSupervised fine-tuning

Show, don't tell.

Give Moondream input/output pairs and it learns to match them. Best for teaching domain-specific concepts or when you already have a dataset.

Best fit for

Classification with a small set of categories
Captioning in a fixed style or voice
Detection with bounding boxes
Structured outputs and form parsing

How much data

Classification

25 to 100 per class

Captioning in a style

100 to 500 examples

Main cost

Producing large data set.

Complex tasks

1,000+

RLReinforcement learning

Reward what works.

Give Moondream a task and score its answer variations. It learns which ones score higher. Best when the model is already somewhat proficient, or when you only have a few examples. Works with as few as 20.

Best fit for

Reasoning and multi-step tasks
Open-ended outputs with many valid answers
Cases where you can verify correctness automatically
Optimizing directly for a metric

How much data

Classification

5 to 20 per class

Reasoning tasks

100 to 500 prompts

Open-ended

Depends on reward quality

Main cost

Designing the scorer

Quick rule

SFTif labeling is cheap.

SFTif you're teaching it new concepts, or using your own domain-specific language or concepts.

RLif labeling is hard but checking is easy.

RLif you only have a small dataset.

Not sure? Send 10 examples and we'll tell you which method to use.

Why Lens

One fine-tune, that runs everywhere.

Closed APIs lock your fine-tune to their endpoint. Open frameworks make you build the training stack yourself. Lens trains the model for you, then lets you serve it from our cloud or run it on your own hardware with Photon.

Train without infrastructure.

Send your data through the API. Get back a model. No GPUs to provision, no training scripts to babysit, no environments to keep in sync.

Run it in our cloud.

Hosted inference on Moondream Cloud. Call your fine-tune from any endpoint, autoscaled, with the same SDK as the base model.

Or run it on device.

Photon runs your fine-tune locally on a Jetson at the edge, a workstation on the factory floor, or an air-gapped server. No data leaves your network.

Small enough to be fast.

Moondream models are small by design. Real-time inference at low cost. Hundreds of inferences per second on a single GPU with Photon.

The 10-image challenge

Bring your hardest task. We'll prove it works.

Send us 10 labeled examples of your task. We will return a fine-tuned Moondream that does it better than the base model. If it does not, you owe us nothing.

Or do it yourself

Pick a skill.

Query, caption, detect, point, or segment.

Collect examples.

10 to 50 labeled examples of your task. More if SFT, fewer if RL.

Call the API.

Pass your data. Get a model back. Deploy it through Photon or run it locally.

Read the docs

Real fine-tunes

Example fine-tunes based on real customer use cases.

Detect · SFT

Player with Ball Detection

Detect the player holding the basketball in NBA broadcast footage.

F1 0.28 → 0.79

Detect · SFT

State Farm Logo Detection

Detect State Farm logos in NBA broadcast frames.

F1 0.38 → 1.00

Query · SFT

GeoGuessr Countries

Predict the country from a single street-view image.

28.6% → 71.1%

Query · RL

Rock Paper Scissors

Classify hand gestures with 5 examples per class.

54.8% → 98.8%

Query · RL

Glaucoma Detection

Classify retinal images by glaucoma stage.

17.6% → 69.2%

Point · SFT

Computer Use

Click the correct UI element from a screenshot and instruction.

27.0% → 63.8%

Frequently asked

Questions, answered.

For SFT, yes. For RL, you need a way to score outputs, which can replace labels. If you can verify correctness automatically (a regex, a scoring function, an external check), RL works without labeled examples.

Moondream is good at a lot of things. Fine-tune it to be great at yours.

Pick the method that fits the data you have.

Show, don't tell.

Reward what works.

One fine-tune, that runs everywhere.

Train without infrastructure.

Run it in our cloud.

Or run it on device.

Small enough to be fast.

Bring your hardest task. We'll prove it works.

Example fine-tunes based on real customer use cases.

Player with Ball Detection

State Farm Logo Detection

GeoGuessr Countries

Rock Paper Scissors

Glaucoma Detection

Computer Use

Questions, answered.

Ready to take Moondreamto production?

Ready to take Moondream
to production?