Photon
Run Moondream wicked fast.
Any hardware, any scale.
Performance
Fast by design
Moondream was architected for fast inference. Photon takes that further with a custom inference engine: optimized scheduling, native image processing, and purpose-built CUDA kernels. Realtime on server hardware. Responsive on edge devices.
Moondream 2 · Full benchmarks
Compatibility
Runs on everything you deploy to
Photon runs on NVIDIA GPUs from Ampere through Blackwell, from embedded devices to multi-GPU servers.
Server
Cloud inference, batch processing, and high-throughput APIs.
- H10080 GB
- A10080 GB
- L40S48 GB
- A1024 GB
- L424 GB
Desktop
Local development, prototyping, and on-prem workloads.
- RTX 409024 GB
- RTX 408016 GB
- RTX 309024 GB
- RTX 306012 GB
Any Ampere or newer GPU
Edge
Cameras, robots, drones, and embedded systems.
- Jetson AGX Orin32 / 64 GB
- Jetson Orin NX16 GB
- Jetson Orin Nano8 GB
JetPack 6.0+
Quickstart
How do I use it?
Install the Python package and start running inference locally in a few lines. The API key is used to access your fine-tunes and for billing telemetry. All inference runs locally on your hardware.
See the full documentation for more details.
pip install moondreamimport moondream as md
from PIL import Image
model = md.vl(api_key="YOUR_API_KEY", local=True)
image = Image.open("photo.jpg")
# Caption
print(model.caption(image)["caption"])
# Visual question answering
print(model.query(image, "What's in this image?")["answer"])
# Stream the response
for chunk in model.caption(image, stream=True)["caption"]:
print(chunk, end="", flush=True)Features
What's under the hood
Everything you need to serve Moondream in production, built into the engine.
Streaming
Real-time token streaming for query and caption tasks.
All Moondream skills
Captioning, visual Q&A, pointing, object detection, and segmentation.
Fine-tune support
Load your Moondream fine-tunes by ID, pulled automatically from the cloud.
Automatic batching
Batches incoming requests transparently without adding per-request latency.
Prefix caching
Caches repeated prompts and images so subsequent requests skip redundant work.
Paged KV cache
Memory-efficient attention cache for handling many concurrent requests.
FAQ