Video Action Detection

Identify the action happening in a video from a 3x3 grid of frames. This teaches the model to reason across time, connecting motion across multiple frames to describe the action. SFT fine-tuning raises template match accuracy from 0% to 50%.

What's a finetune?

A finetune is a custom-trained version of Moondream, optimized for your specific task using your own data. It starts from the base model and learns to perform better on exactly the kind of images and questions you care about.

Start your own finetune

Template Match

Method	SFT
Steps	1,000
Training time	~15h 20m
Cost	Not published

Base Moondream 3 Preview

Result

A shoe is being inserted or removed from a shoe horn.

Incorrect

Fine-tuned Moondream 3 Preview

Result

hitting boot with shoe

Correct

Results

See it in action

These examples use real inputs from the fine-tune and show the output it produces on the task.

3x3 grid of video frames showing someone hitting a boot with a shoe.

Ground truth: "hitting something with something." Fine-tuned: "hitting boot with shoe." Base: rambling scene description.

3x3 grid of video frames showing someone plugging a USB cable.

Ground truth: "plugging something into something." Fine-tuned: "plugging a usb cable into a computer." Base: "manipulating a gaming controller."

3x3 grid of video frames showing someone poking a watermelon.

Ground truth: "poking something so lightly that it doesn't move." Fine-tuned: nearly exact match. Base: generic description.

Getting Started

What is fine-tuning?

Fine-tuning takes a general-purpose vision model and trains it on your specific task. You provide examples of what you want the model to recognize, and the model learns to do that one thing very well. The result is a custom model that performs far better on your task than the base model does out of the box.

Who is this for?

Anyone building a product or system that needs to understand images. You do not need a machine learning background. If you can collect example images and describe what you want the model to see, you can fine-tune Lens.

For teams that want hands-on help, we also offer a white glove fine-tuning service.

How it works

Prepare your data.

Collect images that represent your task. Label them with the outputs you want. For object detection, that means drawing bounding boxes. For classification, that means assigning categories.

Train with the API.

Send your data to the Moondream Lens API. Choose SFT to teach the model from labeled examples, or RL to optimize outputs against a scoring function. Training runs on our infrastructure. There is no hardware to manage.

Deploy your model.

Your fine-tuned model is ready to use through the Moondream API or through Photon, our self-hosted inference engine. Run it in the cloud or on your own hardware.

Code

See the code

These examples show the minimum Python you need to run this workflow with SFT or RL.

import moondream as md

lens = md.Lens()

# Train on labeled examples for this task.
lens.train_step(
    training_data=[
        {
            "image": image,
            "prompt": "This is a 3x3 grid of frames from a video. What action is happening?",
            "output": "hitting boot with shoe",
        }
        for image in training_images
    ]
)

Why Moondream

The fastest path from idea to production

The point is not to learn another platform. The point is to get a custom model into production with as little friction as possible.

Fully hosted.

Training runs on our infrastructure. You send data through the Lens API and get a model back. No GPUs to rent, no environments to configure, and no drivers to debug.

API-only.

Fine-tuning is a handful of API calls. There is no UI to learn, no platform to onboard to, and no proprietary format to adopt. It fits into the workflow you already have.

Pay as you go.

You pay for the compute you use. Fine-tuning starts at a few dollars. Every account gets $5 of free credits each month, so you can run your first fine-tune at no cost.

Built by the model team.

Moondream's fine-tuning system is built by the same team that designed the model architecture. The training pipeline is optimized specifically for Moondream.

White Glove

Need help? We'll build it for you.

Not every team has the bandwidth to run a fine-tune in-house. If you want help, our team can handle the process end to end.

White-glove service

Our team works with you to define the task, prepare the data, run the training, and validate the results. When we are done, we hand off everything: the fine-tuned model, the training data, the evaluation benchmarks, and documentation on how to maintain and improve the model over time.

You own the model. You own the data. We just get you there faster.

Task definition and benchmark design.

Data review, preparation, and labeling guidance.

Training, evaluation, and handoff documentation.

A model your team can run through the API or Photon.

FAQ

Frequently asked questions

Concise answers for teams evaluating Moondream for production fine-tuning.

Bottom CTA

Ready to take Moondream to production?

Every Moondream account includes $5 of free credits per month. No credit card required.

Start fine-tuning

Start with the docs and run your first experiment in a few API calls.

Start fine-tuning

Talk to our team

Tell us what you are building. We can help with data, training, evaluation, and deployment.

Video Action Detection

See it in action

What is fine-tuning?

Who is this for?

How it works

Prepare your data.

Train with the API.

Deploy your model.

See the code

The fastest path from idea to production

Fully hosted.

API-only.

Pay as you go.

Built by the model team.

Need help? We'll build it for you.

Frequently asked questions

How long does a fine-tune take?

How much does it cost?

Do I need machine learning experience?

What's the difference between SFT and RL?

Can I fine-tune on my own data?

What if the fine-tune does not work well enough?

Can I deploy the fine-tuned model on my own hardware?

Ready to take Moondream to production?

Start fine-tuning

Talk to our team