Video Action Description

Given a 3x3 grid of frames taken from a video, describe the action. This requires the model to reason across time, connecting information from one frame to another. SFT fine-tuning raises description accuracy from 54% to 74%.

Start your own fine-tune

Accuracy

Method	SFT
Steps	1,000
Training time	~15 hrs 20 min
Cost	$246.4

See it in action

Compare the base model against the fine-tuned model across representative benchmark examples.

Prompt

Given a 3x3 grid of frames taken from a video, describe the action.

Base model

Incorrect

A shoe is being inserted or removed from a shoe horn.

Fine-tuned model

Correct

hitting boot with shoe

Base model

Incorrect

A person manipulating a black gaming controller using a joystick and buttons.

Fine-tuned model

Correct

plugging a usb cable into a computer

Base model

Incorrect

A repeating sequence of a watermelon...

Fine-tuned model

Correct

poking watermelon so lightly that it doesn't or almost doesn't move

Perfection in 3 steps

Bring examples.

Collect images for the task you want Moondream to learn.

Fine-tune.

Teach Moondream with SFT or RL. Pass your data to the API and we handle the rest.

Deploy.

Use your model through the API or run it locally with Photon.

What is fine-tuning?

Moondream starts as a general model trained on broad, public information. Fine-tuning makes it great at one specific task by teaching it the products, documents, categories, or internal information that matter to your business.

Who is this for?

This is for teams putting vision AI into production. If you already know the task and need the model to master that job, fine-tuning is how you get there. It is built for teams that need frontier performance at real-time speed.

See the code

Fine-tuning is just a small API loop: format your data, call `train_step`, and the model updates as you go.

See full code

import moondream as md

# Create fine-tune
ft = md.ft(
    api_key="your-api-key",
    name="action description",
    rank=32,
)

# Hidden boilerplate and data code

# Update the model
ft.train_step([{
    "mode": "sft",
    "request": {
        "skill": "query",
        "image": pil_image,
        "question": "Given a 3x3 grid of frames taken from a video, describe the action.",
    },
    "target": {"answer": "hitting boot with shoe"},
}])

Frequently asked questions

Ready to take Moondream to production?

Get started

Start with the docs and run your first experiment in a few API calls.

Start fine-tuning

Need help? We'll build it for you.

We can help define the task, prepare the data, run training, validate results, and hand off a model your team can use.

Video Action Description

See it in action

Perfection in 3 steps

Bring examples.

Fine-tune.

Deploy.

What is fine-tuning?

Who is this for?

See the code

Frequently asked questions

Do I need labeled data?

How much data do I need?

Should I use RL or SFT?

How long does training take?

What if the results aren't good enough?

How do you handle my training data?

How can I use my fine-tune locally?

Ready to take Moondream to production?

Get started

Need help? We'll build it for you.