PhotonOfficial Moondream inference engine

Run Moondream fast, wherever production lives.

Photon is the high-performance runtime for Moondream models on edge devices, desktops, servers, and private clouds. Same model skills, lower latency, and no generic inference stack in the middle.

Get started Prefer hosted? See Cloud

Local inferenceMoondream runtime

H100

0.0

req/s

A100

0.0

req/s

L40S

0.0

req/s

A10

0.0

req/s

0.0

req/s

Jetson

0.0

req/s

Moondream 2 · full benchmarks

Low latency

Tuned for real-time visual decisions.

Local

Inference runs on your hardware.

Complete

Built for every Moondream skill.

~2x

Photon

faster than vLLM on comparable Moondream workloads

Realtime inference

Photon

34ms end-to-end inference on an H100.

8+

Photon

hardware tiers from Jetson devices to H100 servers

Fast by design

Photon pairs optimized scheduling, native image processing, and purpose-built CUDA kernels for Moondream instead of a generic model-serving path.

Production serving features

Streaming, automatic batching, prefix caching, and paged KV cache are built into the engine so teams can serve Moondream under real load.

One API everywhere

Run the same Moondream skills on edge devices, local workstations, on-prem servers, or your own cloud infrastructure.

Why Photon

Generic inference engines leave Moondream performance on the table.

Photon only has to serve Moondream, so the runtime can make model-specific decisions about scheduling, memory, image preprocessing, and streaming behavior.

Compatibility

NVIDIA GPUs from embedded edge to multi-GPU servers.

Deploy the same engine on Jetson devices, desktop cards, and server-class GPUs. Your deployment story can move without changing the model API.

Server

Cloud inference, batch jobs, and high-throughput APIs.

H100, A100, L40S, A10, L4

Desktop

Local development, prototyping, and on-prem workloads.

RTX 4090, 4080, 3090, 3060

Edge

Cameras, robots, drones, and embedded systems.

Jetson AGX Orin, Orin NX, Orin Nano

Quickstart

Install the SDK and run locally in a few lines.

The API key accesses your fine-tunes and billing telemetry. Your images, prompts, and inference stay on your hardware.

$pip install moondream

python

import moondream as md
from PIL import Image

model = md.vl(api_key="YOUR_API_KEY", local=True)
image = Image.open("photo.jpg")

print(model.caption(image)["caption"])
print(model.query(image, "What is happening?")["answer"])

Features

Everything you need to serve Moondream in production.

Real-time token streaming for query and caption tasks.

Captioning, visual Q&A, pointing, object detection, and segmentation.

Native support for Moondream fine-tunes loaded directly by ID.

Automatic batching without adding per-request latency.

Prefix caching for repeated prompts and images.

Paged KV cache for many concurrent requests.

Frequently asked questions

Run locally

Put Photon under your Moondream workload.

Start with the SDK, keep inference on your hardware, and bring in our team when you need dedicated deployment help.

Read docs Talk to us

Run Moondream fast, wherever production lives.

~2x

Realtime inference

8+

Generic inference engines leave Moondream performance on the table.

NVIDIA GPUs from embedded edge to multi-GPU servers.

Install the SDK and run locally in a few lines.

Everything you need to serve Moondream in production.

Real-time token streaming for query and caption tasks.

Captioning, visual Q&A, pointing, object detection, and segmentation.

Native support for Moondream fine-tunes loaded directly by ID.

Automatic batching without adding per-request latency.

Prefix caching for repeated prompts and images.

Paged KV cache for many concurrent requests.

Frequently asked questions

Why is Photon faster?

How does Photon compare to other inference options?

Where can I run Photon?

What does it cost?

Does Photon track my usage?

Will Photon work with my Moondream fine-tunes?

Can I use Photon with NVIDIA Triton Inference Server?

How do I get help?

Put Photon under your Moondream workload.