Computer Use
Given a screenshot and an instruction like "Click on Zoom 161%," Moondream returns the point to click. Evaluated on a 1,000-sample held-out test set, the fine-tune raises click accuracy from 27.0% to 63.8%, beating GPT-5.4's 39.9%.
Click accuracy
| Method | SFT |
| Steps | 1,500 |
| Training time | 12 hrs 10 min |
| Cost | $181.4 |
See it in action
Switch between benchmark examples to compare the base model against the fine-tuned model on the same task.
Prompt
Click on "Zoom 161%"
Base Moondream 3 Preview

GPT-5.4

Fine-tuned Moondream 3 Preview

Perfection in 3 steps
What is fine-tuning?
Moondream starts as a general model trained on broad, public information. Fine-tuning makes it great at one specific task by teaching it the products, documents, categories, or internal information that matter to your business.
Who is this for?
This is for teams putting vision AI into production. If you already know the task and need the model to master that job, fine-tuning is how you get there. It is built for teams that need frontier performance at real-time speed.
See the code
Fine-tuning is just a small API loop: format your data, call `train_step`, and the model updates as you go.
import moondream as md
lens = md.Lens()
# Hidden boilerplate and data code
lens.train_step(
training_data=[
{
"image": example["screenshot"],
"prompt": example["instruction"],
"output": {
"x": example["target_x"],
"y": example["target_y"],
},
}
for example in training_data
]
)Frequently asked questions
Ready to take Moondream to production?
Need help? We'll build it for you.
We can help define the task, prepare the data, run training, validate results, and hand off a model your team can use.