MoondreamMoondream · LENS

Moondream is good at a lot of things. Fine-tune it to be great at yours.

Lens is Moondream's hosted fine-tuning API. Bring your task and labeled examples, Lens trains a Moondream variant that beats general-purpose VLMs on the work you actually care about. It runs anywhere from a Jetson at the edge to our cloud.

Read the docs
Open weightsSFT and RLEdge to datacenterCloud or on-device
01
Query

Better answers to questions about your images. Classify by your categories, read your forms, recognize your products.

Structured catalog data, every time
RetailMerchandisingInventorySearch
Prompt
Return JSON with: category, primary_color, accents, closure, sole_material, use_case.
BeforeBase Moondream
Base Moondream
category: Sneakers
primary_color: Off-white
accents: coral
closure: lace-up
sole_material: rubber
use_case: casual wear
AfterFine-tuned with Lens
Fine-tuned with Lens
category: running shoe
primary_color: white
accents: orange
closure: lace-up
sole_material: rubber
use_case: road running
How it learned
Sneakersrunning shoe
Corrects category to the taxonomy your catalog actually uses. Generic labels become the specific class your team merchandises against.
Off-whitewhite
Reflects retail color conventions, where customers shop and filter by the family name, not the literal pixel value.
coralorange
Snaps fashion descriptors back to the swatch language used across product pages, search facets, and merchandising tags.
02
Caption

Captions in your style and voice. Describe what matters for your use case, skip what doesn't.

Same image, your voice
RetailEditorialMarketplacesSEO
Prompt
Caption this listing.
BeforeBase Moondream
Base Moondream
A modern kitchen features dark gray cabinets, light wood flooring, a stainless steel refrigerator, integrated oven and cooktop, and a marble backsplash.
AfterFine-tuned with Lens
Fine-tuned with Lens
Sleek modern kitchen featuring high-gloss charcoal cabinets, a striking waterfall quartz island, and a full-height marble backsplash. Stainless appliances, warm under-cabinet lighting, and wide-plank wood floors tie it all together.
How it learned
modern kitchenSleek modern kitchen
Picks up the editorial register your brand uses on product pages. Descriptive openers instead of bare nouns.
dark gray cabinetshigh-gloss charcoal cabinets
Maps observed colors and finishes to the design vocabulary your merchandisers actually write. Finish first, then hue.
marble backsplashwaterfall quartz island, full-height marble backsplash
Learns the materials and architectural features your listings always call out, even when they sit at the edge of frame.
tie it all together
Avoid the cliché closers ("tie it all together", "a true entertainer's dream") your style guide bans. Add forbidden phrases to the dataset and the model stops producing them.
03
Detect

Find the objects you care about, ignore the rest. Cut false positives down to near zero.

Detect what matters, not just what's there
ManufacturingConstructionQuality controlBrand protection
Prompt
Detect: workers without hard hats.
BeforeBase Moondream
Base Moondream
5 boxes returned. 4 are false positives.
Workers wearing hard hats were flagged anyway.
Only 1 detection is correct.
AfterFine-tuned with Lens
Fine-tuned with Lens
1 box returned: the single worker without a hard hat.
0 false positives on workers wearing PPE.
04
Point

More accurate clicks. Better grounding for agents and UI automation.

Pixel-accurate grounding for agents
Computer useUI automationAccessibilityQA
Prompt
Click on the second reel.
BeforeBase Moondream
Base Moondream
1 point returned, but it's for the second reel column-wise.
AfterFine-tuned with Lens
Fine-tuned with Lens
1 point returned for the correct (row-wise) reel.
Two ways to teach it

Pick the method that fits the data you have.

SFTSupervised fine-tuning

Show, don't tell.

Give Moondream input/output pairs and it learns to match them. Best for teaching domain-specific concepts or when you already have a dataset.

Best fit for
  • Classification with a small set of categories
  • Captioning in a fixed style or voice
  • Detection with bounding boxes
  • Structured outputs and form parsing
How much data
Classification
25 to 100 per class
Captioning in a style
100 to 500 examples
Main cost
Producing large data set.
Complex tasks
1,000+
RLReinforcement learning

Reward what works.

Give Moondream a task and score its answer variations. It learns which ones score higher. Best when the model is already somewhat proficient, or when you only have a few examples. Works with as few as 20.

Best fit for
  • Reasoning and multi-step tasks
  • Open-ended outputs with many valid answers
  • Cases where you can verify correctness automatically
  • Optimizing directly for a metric
How much data
Classification
5 to 20 per class
Reasoning tasks
100 to 500 prompts
Open-ended
Depends on reward quality
Main cost
Designing the scorer
Quick rule
SFTif labeling is cheap.
SFTif you're teaching it new concepts, or using your own domain-specific language or concepts.
RLif labeling is hard but checking is easy.
RLif you only have a small dataset.

Not sure? Send 10 examples and we'll tell you which method to use.

Why Lens

One fine-tune, that runs everywhere.

Closed APIs lock your fine-tune to their endpoint. Open frameworks make you build the training stack yourself. Lens trains the model for you, then lets you serve it from our cloud or run it on your own hardware with Photon.

Train without infrastructure.

Send your data through the API. Get back a model. No GPUs to provision, no training scripts to babysit, no environments to keep in sync.

Run it in our cloud.

Hosted inference on Moondream Cloud. Call your fine-tune from any endpoint, autoscaled, with the same SDK as the base model.

Or run it on device.

Photon runs your fine-tune locally on a Jetson at the edge, a workstation on the factory floor, or an air-gapped server. No data leaves your network.

Small enough to be fast.

Moondream models are small by design. Real-time inference at low cost. Hundreds of inferences per second on a single GPU with Photon.

The 10-image challenge

Bring your hardest task. We'll prove it works.

Send us 10 labeled examples of your task. We will return a fine-tuned Moondream that does it better than the base model. If it does not, you owe us nothing.

Or do it yourself
01
Pick a skill.
Query, caption, detect, point, or segment.
02
Collect examples.
10 to 50 labeled examples of your task. More if SFT, fewer if RL.
03
Call the API.
Pass your data. Get a model back. Deploy it through Photon or run it locally.
Read the docs
Real fine-tunes

Example fine-tunes based on real customer use cases.

Browse all examples
Frequently asked

Questions, answered.

For SFT, yes. For RL, you need a way to score outputs, which can replace labels. If you can verify correctness automatically (a regex, a scoring function, an external check), RL works without labeled examples.

Ready to take Moondream
to production?