Moondream's mission is simple: production vision AI that runs everywhere. Most cloud VLMs take seconds to respond, which doesn't work for systems that need to be fast and often run on-device or at the edge. So we built the full stack ourselves: our own models, a fine-tune service (Lens), and an inference engine (Photon). Today's Photon update in the Moondream 1.2.0 release makes it faster still.
Getting Started
To install:
pip install moondream
Then run locally by setting local=True:
import moondream as md
from PIL import Image
model = md.vl(api_key="YOUR_API_KEY", local=True)
image = Image.open("photo.jpg")
print(model.caption(image)["caption"])
That local=True flag is the important part. It tells Moondream to run inference on your machine using Photon instead of sending the request to the hosted API. With Photon 1.2.0, local Moondream inference now supports:
| Platform | What's new |
|---|---|
| Apple Silicon | Native inference on M-series Macs |
| Windows x86_64 | Native CUDA inference (no WSL required) or Linux containers |
| NVIDIA Blackwell | Support for B200 and RTX PRO 6000 |
| NVIDIA Jetson Thor | Edge inference on JetPack 7 / CUDA 13 |
| Existing NVIDIA GPUs | Faster prefill, MoE, dispatch, and tail latency |
The result: Moondream is now easier to deploy across laptops, workstations, edge devices, and production GPU servers.
Why This Release Matters
Production vision AI depends on more than model quality. It needs to:
- be fast enough for real applications.
- run on the hardware teams already use.
- work outside a single cloud or GPU environment.
- be simple enough to install and ship.
Photon 1.2.0 improves Moondream across all of those dimensions. It expands native hardware support, reduces setup complexity, improves single-request latency, and increases throughput on both new and existing GPUs.
That matters for applications like:
| Use case | What Photon improves |
|---|---|
| Interactive image apps | Faster answers from a single request |
| Production APIs | Higher request throughput |
| Robotics and inspection | Local inference without cloud round trips |
| Desktop tools | Native Mac and Windows support |
| Edge devices | Vision AI where network latency or privacy matters |
| Private workflows | Images can stay on-device |
Native Moondream Inference on Apple Silicon
Photon now runs on Apple M-series Macs starting with macOS 13 Ventura. No NVIDIA GPU required — any developer can pip install moondream on the Mac they already use and run the same Moondream models locally that power our production cloud. Photon uses native Metal kernels across the decode path, including paged attention, rotary embeddings, KV cache management, MoE routing, sampling, and layer norm. KV cache sizing is automatically tuned to the Mac's unified memory.
Reference performance on ChartQA, batch size 4, direct mode:
| Hardware | Moondream 2 | Moondream 3 |
|---|---|---|
| MacBook Pro, M5 Max, 48 GB | 7.26 requests/sec | 4.58 requests/sec |
| Mac mini, M2, 24 GB | 0.79 requests/sec | 0.55 requests/sec |
| Mac mini, M4 base, 16 GB | 0.84 requests/sec | — |
Apple Silicon support makes local Moondream development much more practical: demos, prototypes, desktop apps, and privacy-sensitive workflows can run directly on a Mac.
Native Windows Support
Photon now supports native Windows x86_64 inference. This is not a Linux wrapper. Photon's kernel-loading runtime has been rebuilt to support Windows directly, including MSVC compatibility, Windows DLL loading semantics, and cross-platform library naming across the kernel stack. Windows systems now run the same CUDA kernels as Linux x86_64.
Low-Latency and High-Throughput Inference on Blackwell
Photon 1.2.0 adds support for NVIDIA Blackwell, including B200 data-center GPUs and RTX PRO 6000 workstation GPUs. B200 is now the fastest hardware Photon supports:
| Hardware | Model | Single-request latency | Batch 64 throughput |
|---|---|---|---|
| NVIDIA B200 | Moondream 2 | ~23 ms | 93.61 requests/sec |
| NVIDIA B200 | Moondream 3 | ~30 ms | 71.27 requests/sec |
The single-request latency is derived from batch size 1 performance. This is the number that matters when an application needs an answer immediately. The batch 64 number shows high-volume throughput. This is the number that matters when a system is serving many requests at once. At batch size 64, B200 is:
| Model | Speedup vs. H100 |
|---|---|
| Moondream 2 | 1.49× faster |
| Moondream 3 | 1.23× faster |
Photon also supports RTX PRO 6000, which reaches 39.3 requests/sec on Moondream 2 and 39.7 requests/sec on Moondream 3 at batch size 64. Under the hood, this release includes Blackwell-specific MoE kernels and dedicated Blackwell flash-attention kernels for both decode and prefill. The practical result is lower latency for interactive workloads and higher throughput for production serving.
Edge Inference on Jetson Thor
Photon now also supports NVIDIA Jetson AGX Thor 64 GB on JetPack 7.
This brings Moondream to a new class of edge deployments: robotics, inspection systems, kiosks, vehicles, cameras, and embedded vision products where cloud inference may add latency, cost, or privacy concerns.
Reference performance:
| Hardware | Model | Single-request latency | Batch 64 throughput |
|---|---|---|---|
| Jetson AGX Thor | Moondream 2 | ~152 ms | 14.53 requests/sec |
| Jetson AGX Thor | Moondream 3 | ~147 ms | 12.05 requests/sec |
That means Moondream can run locally on Jetson Thor and return vision-language answers in well under a second.
Photon also now ships a multi-CUDA Linux aarch64 wheel. The same install works across Jetson Thor, Jetson Orin, and GH200 systems. Photon selects the correct CUDA build automatically: CUDA 13 for Thor on JetPack 7, CUDA 12 for Jetson Orin and GH200 systems on JetPack 6.
Faster on Existing NVIDIA GPUs
Photon 1.2.0 also improves performance on existing NVIDIA hardware, including L40S, RTX 4090, Jetson Orin, A100, A10/A10G, L4, and RTX 6000.
The main improvements are:
| Improvement | Impact |
|---|---|
| Faster FP8 prefill on Ada and Jetson Orin | Better performance for FP8 KV cache deployments |
| New native paged flash-attention kernels | Faster prefill and decode paths |
| Faster MoE inference | Better Moondream 3 performance across GPUs |
| Lower per-call dispatch overhead | Faster batch 1 and small-batch inference |
| More consistent tail latency | More predictable application performance |
Small-batch performance is especially important in real applications. When a user asks a question about an image, batch size 1 latency determines how fast the answer comes back. Photon 1.2.0 reduces overhead in that path while also improving throughput for larger production batches.
Conclusion
Photon 1.2.0 expands where Moondream can be deployed and improves how fast it responds. Full benchmark details, including additional batch sizes and chain-of-thought mode results, are available in PERFORMANCE.md.
With Moondream, you don't have to compromise. You can get sophisticated visual reasoning at near-realtime speeds, and it runs everywhere. Got a production-level vision challenge? Contact us, we'd love to talk.



