Photon 1.2.0: Faster Inference, Now on Mac, Windows, Blackwell, and Jetson Thor

Moondream's mission is simple: production vision AI that runs everywhere. Most cloud VLMs take seconds to respond, which doesn't work for systems that need to be fast and often run on-device or at the edge. So we built the full stack ourselves: our own models, a fine-tune service (Lens), and an inference engine (Photon). Today's Photon update in the Moondream 1.2.0 release makes it faster still.

Getting Started

To install:

pip install moondream

Then run locally by setting local=True:

import moondream as md
from PIL import Image

model = md.vl(api_key="YOUR_API_KEY", local=True)
image = Image.open("photo.jpg")

print(model.caption(image)["caption"])

That local=True flag is the important part. It tells Moondream to run inference on your machine using Photon instead of sending the request to the hosted API. With Photon 1.2.0, local Moondream inference now supports:

Platform	What's new
Apple Silicon	Native inference on M-series Macs
Windows x86_64	Native CUDA inference (no WSL required) or Linux containers
NVIDIA Blackwell	Support for B200 and RTX PRO 6000
NVIDIA Jetson Thor	Edge inference on JetPack 7 / CUDA 13
Existing NVIDIA GPUs	Faster prefill, MoE, dispatch, and tail latency

The result: Moondream is now easier to deploy across laptops, workstations, edge devices, and production GPU servers.

Why This Release Matters

Production vision AI depends on more than model quality. It needs to:

be fast enough for real applications.
run on the hardware teams already use.
work outside a single cloud or GPU environment.
be simple enough to install and ship.

Photon 1.2.0 improves Moondream across all of those dimensions. It expands native hardware support, reduces setup complexity, improves single-request latency, and increases throughput on both new and existing GPUs.

That matters for applications like:

Use case	What Photon improves
Interactive image apps	Faster answers from a single request
Production APIs	Higher request throughput
Robotics and inspection	Local inference without cloud round trips
Desktop tools	Native Mac and Windows support
Edge devices	Vision AI where network latency or privacy matters
Private workflows	Images can stay on-device

Native Moondream Inference on Apple Silicon

Photon now runs on Apple M-series Macs starting with macOS 13 Ventura. No NVIDIA GPU required — any developer can pip install moondream on the Mac they already use and run the same Moondream models locally that power our production cloud. Photon uses native Metal kernels across the decode path, including paged attention, rotary embeddings, KV cache management, MoE routing, sampling, and layer norm. KV cache sizing is automatically tuned to the Mac's unified memory.

Reference performance on ChartQA, batch size 4, direct mode:

Hardware	Moondream 2	Moondream 3
MacBook Pro, M5 Max, 48 GB	7.26 requests/sec	4.58 requests/sec
Mac mini, M2, 24 GB	0.79 requests/sec	0.55 requests/sec
Mac mini, M4 base, 16 GB	0.84 requests/sec	—

Apple Silicon support makes local Moondream development much more practical: demos, prototypes, desktop apps, and privacy-sensitive workflows can run directly on a Mac.

Native Windows Support

Photon now supports native Windows x86_64 inference. This is not a Linux wrapper. Photon's kernel-loading runtime has been rebuilt to support Windows directly, including MSVC compatibility, Windows DLL loading semantics, and cross-platform library naming across the kernel stack. Windows systems now run the same CUDA kernels as Linux x86_64.

Low-Latency and High-Throughput Inference on Blackwell

Photon 1.2.0 adds support for NVIDIA Blackwell, including B200 data-center GPUs and RTX PRO 6000 workstation GPUs. B200 is now the fastest hardware Photon supports:

Hardware	Model	Single-request latency	Batch 64 throughput
NVIDIA B200	Moondream 2	~23 ms	93.61 requests/sec
NVIDIA B200	Moondream 3	~30 ms	71.27 requests/sec

The single-request latency is derived from batch size 1 performance. This is the number that matters when an application needs an answer immediately. The batch 64 number shows high-volume throughput. This is the number that matters when a system is serving many requests at once. At batch size 64, B200 is:

Model	Speedup vs. H100
Moondream 2	1.49× faster
Moondream 3	1.23× faster

Photon also supports RTX PRO 6000, which reaches 39.3 requests/sec on Moondream 2 and 39.7 requests/sec on Moondream 3 at batch size 64. Under the hood, this release includes Blackwell-specific MoE kernels and dedicated Blackwell flash-attention kernels for both decode and prefill. The practical result is lower latency for interactive workloads and higher throughput for production serving.

Edge Inference on Jetson Thor

Photon now also supports NVIDIA Jetson AGX Thor 64 GB on JetPack 7.

This brings Moondream to a new class of edge deployments: robotics, inspection systems, kiosks, vehicles, cameras, and embedded vision products where cloud inference may add latency, cost, or privacy concerns.

Reference performance:

Hardware	Model	Single-request latency	Batch 64 throughput
Jetson AGX Thor	Moondream 2	~152 ms	14.53 requests/sec
Jetson AGX Thor	Moondream 3	~147 ms	12.05 requests/sec

That means Moondream can run locally on Jetson Thor and return vision-language answers in well under a second.

Photon also now ships a multi-CUDA Linux aarch64 wheel. The same install works across Jetson Thor, Jetson Orin, and GH200 systems. Photon selects the correct CUDA build automatically: CUDA 13 for Thor on JetPack 7, CUDA 12 for Jetson Orin and GH200 systems on JetPack 6.

Faster on Existing NVIDIA GPUs

Photon 1.2.0 also improves performance on existing NVIDIA hardware, including L40S, RTX 4090, Jetson Orin, A100, A10/A10G, L4, and RTX 6000.

The main improvements are:

Improvement	Impact
Faster FP8 prefill on Ada and Jetson Orin	Better performance for FP8 KV cache deployments
New native paged flash-attention kernels	Faster prefill and decode paths
Faster MoE inference	Better Moondream 3 performance across GPUs
Lower per-call dispatch overhead	Faster batch 1 and small-batch inference
More consistent tail latency	More predictable application performance

Small-batch performance is especially important in real applications. When a user asks a question about an image, batch size 1 latency determines how fast the answer comes back. Photon 1.2.0 reduces overhead in that path while also improving throughput for larger production batches.

Conclusion

Photon 1.2.0 expands where Moondream can be deployed and improves how fast it responds. Full benchmark details, including additional batch sizes and chain-of-thought mode results, are available in PERFORMANCE.md.

With Moondream, you don't have to compromise. You can get sophisticated visual reasoning at near-realtime speeds, and it runs everywhere. Got a production-level vision challenge? Contact us, we'd love to talk.