Video Action Description
Given a 3x3 grid of frames taken from a video, describe the action. This requires the model to reason across time, connecting information from one frame to another. SFT fine-tuning raises description accuracy from 54% to 74%.
Accuracy
| Method | SFT |
| Steps | 1,000 |
| Training time | ~15 hrs 20 min |
| Cost | $246.4 |
See it in action
Compare the base model against the fine-tuned model across representative benchmark examples.
Prompt
Given a 3x3 grid of frames taken from a video, describe the action.

Base model
IncorrectA shoe is being inserted or removed from a shoe horn.
Fine-tuned model
Correcthitting boot with shoe

Base model
IncorrectA person manipulating a black gaming controller using a joystick and buttons.
Fine-tuned model
Correctplugging a usb cable into a computer

Base model
IncorrectA repeating sequence of a watermelon...
Fine-tuned model
Correctpoking watermelon so lightly that it doesn't or almost doesn't move
Perfection in 3 steps
What is fine-tuning?
Moondream starts as a general model trained on broad, public information. Fine-tuning makes it great at one specific task by teaching it the products, documents, categories, or internal information that matter to your business.
Who is this for?
This is for teams putting vision AI into production. If you already know the task and need the model to master that job, fine-tuning is how you get there. It is built for teams that need frontier performance at real-time speed.
See the code
Fine-tuning is just a small API loop: format your data, call `train_step`, and the model updates as you go.
import moondream as md
# Create fine-tune
ft = md.ft(
api_key="your-api-key",
name="action description",
rank=32,
)
# Hidden boilerplate and data code
# Update the model
ft.train_step([{
"mode": "sft",
"request": {
"skill": "query",
"image": pil_image,
"question": "Given a 3x3 grid of frames taken from a video, describe the action.",
},
"target": {"answer": "hitting boot with shoe"},
}])Frequently asked questions
Ready to take Moondream to production?
Need help? We'll build it for you.
We can help define the task, prepare the data, run training, validate results, and hand off a model your team can use.