Moondream logo

Photon

Run Moondream wicked fast.
Any hardware, any scale.

Get Started

Performance

Fast by design

Moondream was architected for fast inference. Photon takes that further with a custom inference engine: optimized scheduling, native image processing, and purpose-built CUDA kernels. Realtime on server hardware. Responsive on edge devices.

H100
0.0
req/s
A100
0.0
req/s
L40S
0.0
req/s
A10
0.0
req/s
L4
0.0
req/s
Jetson
0.0
req/s

Moondream 2 · Full benchmarks

Compatibility

Runs on everything you deploy to

Photon runs on NVIDIA GPUs from Ampere through Blackwell, from embedded devices to multi-GPU servers.

Server

Cloud inference, batch processing, and high-throughput APIs.

  • H10080 GB
  • A10080 GB
  • L40S48 GB
  • A1024 GB
  • L424 GB

Desktop

Local development, prototyping, and on-prem workloads.

  • RTX 409024 GB
  • RTX 408016 GB
  • RTX 309024 GB
  • RTX 306012 GB

Any Ampere or newer GPU

Edge

Cameras, robots, drones, and embedded systems.

  • Jetson AGX Orin32 / 64 GB
  • Jetson Orin NX16 GB
  • Jetson Orin Nano8 GB

JetPack 6.0+

Quickstart

How do I use it?

Install the Python package and start running inference locally in a few lines. The API key is used to access your fine-tunes and for billing telemetry. All inference runs locally on your hardware.

See the full documentation for more details.

pip install moondream
import moondream as md
from PIL import Image

model = md.vl(api_key="YOUR_API_KEY", local=True)

image = Image.open("photo.jpg")

# Caption
print(model.caption(image)["caption"])

# Visual question answering
print(model.query(image, "What's in this image?")["answer"])

# Stream the response
for chunk in model.caption(image, stream=True)["caption"]:
    print(chunk, end="", flush=True)

Features

What's under the hood

Everything you need to serve Moondream in production, built into the engine.

  • Streaming

    Real-time token streaming for query and caption tasks.

  • All Moondream skills

    Captioning, visual Q&A, pointing, object detection, and segmentation.

  • Fine-tune support

    Load your Moondream fine-tunes by ID, pulled automatically from the cloud.

  • Automatic batching

    Batches incoming requests transparently without adding per-request latency.

  • Prefix caching

    Caches repeated prompts and images so subsequent requests skip redundant work.

  • Paged KV cache

    Memory-efficient attention cache for handling many concurrent requests.

FAQ

Frequently asked questions