Finetunes

How finetuning works

Finetuning adapts a base vision model to a specific workflow. By training on focused examples and validating against the base model, the resulting finetuned model can deliver measurable improvements on its task.

Collect focused examples

Curate representative edge cases for the exact task you want to improve.

Train a finetuned model

Run a finetune on that task-specific data so the model learns the target behavior.

Measure and ship

Evaluate against the base model, then ship the finetuned model with clear performance lift.

Demo Finetunes

Explore demo finetunes

Browse task-specific finetunes and open each demo page to see before/after behavior.

Basketball game frame for ball-handler detection

Object Detection

Player with Ball Detection

Detect which player is currently holding the basketball in NBA game footage. The base model fires dozens of false-positive boxes. After reinforcement learning fine-tuning, the model locks onto the correct player with an F1 of 0.79, compared to 0.53 from GPT-5.4.

F128.3%→78.8%

Object Detection

State Farm Logo Detection

Detect State Farm logos in NBA broadcast footage. The base model produces 30 false positives and misses 18 logos across the test set. After fine-tuning, F1 reaches 1.0 with zero false positives and zero false negatives.

Predict the country from a single street-view image by reading road markings, signage, and vegetation. SFT fine-tuning on a small dataset takes Moondream from 28.6% to 71.1% accuracy across 53 countries, outperforming GPT-5.4 at 69.8%.

Accuracy28.6%→71.1%

Classification

Rock Paper Scissors

Classify real photos of hand gestures as rock, paper, or scissors. With only 5 training examples per class and 50 RL steps, accuracy jumps from 54.8% to 98.8%. This demo highlights extreme data efficiency: a useful model from almost no training data.

Accuracy54.8%→98.8%

Retinal fundus image for glaucoma classification

Medical Imaging

Glaucoma Detection

Classify retinal fundus images into three glaucoma stages: normal, early, or advanced. The base model achieves only 17.6% accuracy. After 100 RL steps, the fine-tune reaches 69.2%, more than double GPT-5.4's 33.2%.

Accuracy17.6%→69.2%

3x3 grid of video frames for action detection

Video Understanding

SFT

Video Action Detection

Identify the action happening in a video from a 3x3 grid of frames. This teaches the model to reason across time, connecting motion across multiple frames to describe the action. SFT fine-tuning raises template match accuracy from 0% to 50%.

Template Match0%→50%

Aerial airport image for airplane detection

Point Detection

Aerial Airplane Detection

Localize airplanes in satellite imagery of airports using Moondream's point skill. The base model fires 1,906 false positives. After fine-tuning with tiled inference, F1 improves from 0.30 to 0.55, while GPT-5.4 achieves only 0.10.

F129.5%→55.1%