Photon
The official Moondream inference engine.
Runs all Moondream models on edge, desktop, or server.
The fastest VLM inference on the planet.
Performance
Fast by design
Moondream was architected for fast inference. Photon takes that further with a custom inference engine: optimized scheduling, native image processing, and purpose-built CUDA kernels. Realtime on server hardware. Responsive on edge devices.
Moondream 2 · Full benchmarks
Compatibility
Runs everywhere you need it
Photon runs on NVIDIA GPUs from Ampere through Blackwell, from embedded devices to multi-GPU servers.
Server
Cloud inference, batch processing, and high-throughput APIs.
- H10080 GB
- A10080 GB
- L40S48 GB
- A1024 GB
- L424 GB
Desktop
Local development, prototyping, and on-prem workloads.
- RTX 409024 GB
- RTX 408016 GB
- RTX 309024 GB
- RTX 306012 GB
Any Ampere or newer GPU
Edge
Cameras, robots, drones, and embedded systems.
- Jetson AGX Orin32 / 64 GB
- Jetson Orin NX16 GB
- Jetson Orin Nano8 GB
JetPack 6.0+
Quickstart
How do I use it?
Install the Python package and start running inference locally in a few lines. The API key is used to access your fine-tunes and for billing telemetry. All inference runs locally on your hardware.
See the full documentation for more details.
pip install moondreamimport moondream as md
from PIL import Image
model = md.vl(api_key="YOUR_API_KEY", local=True)
image = Image.open("photo.jpg")
# Caption
print(model.caption(image)["caption"])
# Visual question answering
print(model.query(image, "What's in this image?")["answer"])
# Stream the response
for chunk in model.caption(image, stream=True)["caption"]:
print(chunk, end="", flush=True)Features
What's under the hood
Everything you need to serve Moondream in production, built into the engine.
Streaming
Real-time token streaming for query and caption tasks.
All Moondream skills
Captioning, visual Q&A, pointing, object detection, and segmentation.
Fine-tune support
Load your Moondream fine-tunes by ID, pulled automatically from the cloud.
Automatic batching
Batches incoming requests transparently without adding per-request latency.
Prefix caching
Caches repeated prompts and images so subsequent requests skip redundant work.
Paged KV cache
Memory-efficient attention cache for handling many concurrent requests.
FAQ