Finetunes
How finetuning works
Finetuning adapts a base vision model to a specific workflow. By training on focused examples and validating against the base model, the resulting finetuned model can deliver measurable improvements on its task.
Collect focused examples
Curate representative edge cases for the exact task you want to improve.
Train a finetuned model
Run a finetune on that task-specific data so the model learns the target behavior.
Measure and ship
Evaluate against the base model, then ship the finetuned model with clear performance lift.
Demo Finetunes
Explore demo finetunes
Browse task-specific finetunes and open each demo page to see before/after behavior.

Object Detection
RL
Player with Ball Detection
Detect which player is currently holding the basketball in NBA game footage. The base model fires dozens of false-positive boxes. After reinforcement learning fine-tuning, the model locks onto the correct player with an F1 of 0.79, compared to 0.53 from GPT-5.4.
F128.3%→78.8%

Object Detection
RL
State Farm Logo Detection
Detect State Farm logos in NBA broadcast footage. The base model produces 30 false positives and misses 18 logos across the test set. After fine-tuning, F1 reaches 1.0 with zero false positives and zero false negatives.
F138.5%→100%

Geolocation
SFT
GeoGuessr Countries
Predict the country from a single street-view image by reading road markings, signage, and vegetation. SFT fine-tuning on a small dataset takes Moondream from 28.6% to 71.1% accuracy across 53 countries, outperforming GPT-5.4 at 69.8%.
Accuracy28.6%→71.1%

Classification
RL
Rock Paper Scissors
Classify real photos of hand gestures as rock, paper, or scissors. With only 5 training examples per class and 50 RL steps, accuracy jumps from 54.8% to 98.8%. This demo highlights extreme data efficiency: a useful model from almost no training data.
Accuracy54.8%→98.8%

Medical Imaging
RL
Glaucoma Detection
Classify retinal fundus images into three glaucoma stages: normal, early, or advanced. The base model achieves only 17.6% accuracy. After 100 RL steps, the fine-tune reaches 69.2%, more than double GPT-5.4's 33.2%.
Accuracy17.6%→69.2%

Video Understanding
SFT
Video Action Detection
Identify the action happening in a video from a 3x3 grid of frames. This teaches the model to reason across time, connecting motion across multiple frames to describe the action. SFT fine-tuning raises template match accuracy from 0% to 50%.
Template Match0%→50%

Point Detection
RL
Aerial Airplane Detection
Localize airplanes in satellite imagery of airports using Moondream's point skill. The base model fires 1,906 false positives. After fine-tuning with tiled inference, F1 improves from 0.30 to 0.55, while GPT-5.4 achieves only 0.10.
F129.5%→55.1%