Run Moondream fast, wherever production lives.
Photon is the high-performance runtime for Moondream models on edge devices, desktops, servers, and private clouds. Same model skills, lower latency, and no generic inference stack in the middle.
~2x
Photonfaster than vLLM on comparable Moondream workloads
Realtime inference
Photon34ms end-to-end inference on an H100.
8+
Photonhardware tiers from Jetson devices to H100 servers
Photon pairs optimized scheduling, native image processing, and purpose-built CUDA kernels for Moondream instead of a generic model-serving path.
Streaming, automatic batching, prefix caching, and paged KV cache are built into the engine so teams can serve Moondream under real load.
Run the same Moondream skills on edge devices, local workstations, on-prem servers, or your own cloud infrastructure.
Generic inference engines leave Moondream performance on the table.
Photon only has to serve Moondream, so the runtime can make model-specific decisions about scheduling, memory, image preprocessing, and streaming behavior.
NVIDIA GPUs from embedded edge to multi-GPU servers.
Deploy the same engine on Jetson devices, desktop cards, and server-class GPUs. Your deployment story can move without changing the model API.
Cloud inference, batch jobs, and high-throughput APIs.
Local development, prototyping, and on-prem workloads.
Cameras, robots, drones, and embedded systems.
Install the SDK and run locally in a few lines.
The API key accesses your fine-tunes and billing telemetry. Your images, prompts, and inference stay on your hardware.
import moondream as md from PIL import Image model = md.vl(api_key="YOUR_API_KEY", local=True) image = Image.open("photo.jpg") print(model.caption(image)["caption"]) print(model.query(image, "What is happening?")["answer"])
Everything you need to serve Moondream in production.
Real-time token streaming for query and caption tasks.
01Captioning, visual Q&A, pointing, object detection, and segmentation.
02Native support for Moondream fine-tunes loaded directly by ID.
03Automatic batching without adding per-request latency.
04Prefix caching for repeated prompts and images.
05Paged KV cache for many concurrent requests.
06Frequently asked questions
Put Photon under your Moondream workload.
Start with the SDK, keep inference on your hardware, and bring in our team when you need dedicated deployment help.