Video Activity Recognition for Retail

National Retail Chain

Automated detection of specific actions in surveillance footage for loss prevention and operations.

The Challenge

Monitoring in-store activity across hundreds of camera feeds required a large team of security operators. Human reviewers could only actively watch a fraction of feeds at any time, and fatigue led to missed incidents. The retailer needed automated activity classification to flag events of interest in real time.

The Solution

Fine-tuned on labeled video frames covering dozens of action categories, the Moondream model identifies specific activities happening in each frame. SFT training on temporal action data enabled the model to distinguish between similar actions that the base model consistently confused.

Business Impact

Action classification accuracy improved from 10.2% to 38.4%
Covers 174 distinct action categories
Enables real-time flagging across hundreds of camera feeds
Reduced security staffing requirements for video monitoring

Complete Vision AI Stack

This solution uses Moondream's integrated stack from model training through production deployment. Every layer is designed to work together, so you go from problem to deployed system without stitching together tools from different vendors.

View Fine-Tune Details

AI Model Layer

Base Model

Moondream 3

Fine-Tuning

SFT via Lens

Production Model

Moondream 2

Deployment Layer

Inference Engine

Photon

Target Hardware

NVIDIA T4

Deployment

On-Premises

Technical Details

Training Method

SFT

Training Steps

1000

Task Type

query

Accuracy

0.5413 → 0.7391

Ready to build your solution?

Talk to our team about how Moondream can solve your specific vision AI challenge, from model training through production deployment.

View Technical Details