Announcement

Photon 1.2.0: Faster Inference, Now on Mac, Windows, Blackwell, and Jetson Thor

Production vision AI that runs everywhere — now faster, on more hardware.

May 1, 2026
Photon 1.2.0: Faster Inference, Now on Mac, Windows, Blackwell, and Jetson Thor

Moondream's mission is simple: production vision AI that runs everywhere. Most cloud VLMs take seconds to respond, which doesn't work for systems that need to be fast and often run on-device or at the edge. So we built the full stack ourselves: our own models, a fine-tune service (Lens), and an inference engine (Photon). Today's Photon update in the Moondream 1.2.0 release makes it faster still.

Getting Started

To install:

pip install moondream

Then run locally by setting local=True:

import moondream as md
from PIL import Image

model = md.vl(api_key="YOUR_API_KEY", local=True)
image = Image.open("photo.jpg")

print(model.caption(image)["caption"])

That local=True flag is the important part. It tells Moondream to run inference on your machine using Photon instead of sending the request to the hosted API. With Photon 1.2.0, local Moondream inference now supports:

PlatformWhat's new
Apple SiliconNative inference on M-series Macs
Windows x86_64Native CUDA inference (no WSL required) or Linux containers
NVIDIA BlackwellSupport for B200 and RTX PRO 6000
NVIDIA Jetson ThorEdge inference on JetPack 7 / CUDA 13
Existing NVIDIA GPUsFaster prefill, MoE, dispatch, and tail latency

The result: Moondream is now easier to deploy across laptops, workstations, edge devices, and production GPU servers.

Why This Release Matters

Production vision AI depends on more than model quality. It needs to:

  • be fast enough for real applications.
  • run on the hardware teams already use.
  • work outside a single cloud or GPU environment.
  • be simple enough to install and ship.

Photon 1.2.0 improves Moondream across all of those dimensions. It expands native hardware support, reduces setup complexity, improves single-request latency, and increases throughput on both new and existing GPUs.

That matters for applications like:

Use caseWhat Photon improves
Interactive image appsFaster answers from a single request
Production APIsHigher request throughput
Robotics and inspectionLocal inference without cloud round trips
Desktop toolsNative Mac and Windows support
Edge devicesVision AI where network latency or privacy matters
Private workflowsImages can stay on-device

Native Moondream Inference on Apple Silicon

Photon now runs on Apple M-series Macs starting with macOS 13 Ventura. No NVIDIA GPU required — any developer can pip install moondream on the Mac they already use and run the same Moondream models locally that power our production cloud. Photon uses native Metal kernels across the decode path, including paged attention, rotary embeddings, KV cache management, MoE routing, sampling, and layer norm. KV cache sizing is automatically tuned to the Mac's unified memory.

Reference performance on ChartQA, batch size 4, direct mode:

HardwareMoondream 2Moondream 3
MacBook Pro, M5 Max, 48 GB7.26 requests/sec4.58 requests/sec
Mac mini, M2, 24 GB0.79 requests/sec0.55 requests/sec
Mac mini, M4 base, 16 GB0.84 requests/sec

Apple Silicon support makes local Moondream development much more practical: demos, prototypes, desktop apps, and privacy-sensitive workflows can run directly on a Mac.

Native Windows Support

Photon now supports native Windows x86_64 inference. This is not a Linux wrapper. Photon's kernel-loading runtime has been rebuilt to support Windows directly, including MSVC compatibility, Windows DLL loading semantics, and cross-platform library naming across the kernel stack. Windows systems now run the same CUDA kernels as Linux x86_64.

Low-Latency and High-Throughput Inference on Blackwell

Photon 1.2.0 adds support for NVIDIA Blackwell, including B200 data-center GPUs and RTX PRO 6000 workstation GPUs. B200 is now the fastest hardware Photon supports:

HardwareModelSingle-request latencyBatch 64 throughput
NVIDIA B200Moondream 2~23 ms93.61 requests/sec
NVIDIA B200Moondream 3~30 ms71.27 requests/sec

The single-request latency is derived from batch size 1 performance. This is the number that matters when an application needs an answer immediately. The batch 64 number shows high-volume throughput. This is the number that matters when a system is serving many requests at once. At batch size 64, B200 is:

ModelSpeedup vs. H100
Moondream 21.49× faster
Moondream 31.23× faster

Photon also supports RTX PRO 6000, which reaches 39.3 requests/sec on Moondream 2 and 39.7 requests/sec on Moondream 3 at batch size 64. Under the hood, this release includes Blackwell-specific MoE kernels and dedicated Blackwell flash-attention kernels for both decode and prefill. The practical result is lower latency for interactive workloads and higher throughput for production serving.

Edge Inference on Jetson Thor

Photon now also supports NVIDIA Jetson AGX Thor 64 GB on JetPack 7.

This brings Moondream to a new class of edge deployments: robotics, inspection systems, kiosks, vehicles, cameras, and embedded vision products where cloud inference may add latency, cost, or privacy concerns.

Reference performance:

HardwareModelSingle-request latencyBatch 64 throughput
Jetson AGX ThorMoondream 2~152 ms14.53 requests/sec
Jetson AGX ThorMoondream 3~147 ms12.05 requests/sec

That means Moondream can run locally on Jetson Thor and return vision-language answers in well under a second.

Photon also now ships a multi-CUDA Linux aarch64 wheel. The same install works across Jetson Thor, Jetson Orin, and GH200 systems. Photon selects the correct CUDA build automatically: CUDA 13 for Thor on JetPack 7, CUDA 12 for Jetson Orin and GH200 systems on JetPack 6.

Faster on Existing NVIDIA GPUs

Photon 1.2.0 also improves performance on existing NVIDIA hardware, including L40S, RTX 4090, Jetson Orin, A100, A10/A10G, L4, and RTX 6000.

The main improvements are:

ImprovementImpact
Faster FP8 prefill on Ada and Jetson OrinBetter performance for FP8 KV cache deployments
New native paged flash-attention kernelsFaster prefill and decode paths
Faster MoE inferenceBetter Moondream 3 performance across GPUs
Lower per-call dispatch overheadFaster batch 1 and small-batch inference
More consistent tail latencyMore predictable application performance

Small-batch performance is especially important in real applications. When a user asks a question about an image, batch size 1 latency determines how fast the answer comes back. Photon 1.2.0 reduces overhead in that path while also improving throughput for larger production batches.

Conclusion

Photon 1.2.0 expands where Moondream can be deployed and improves how fast it responds. Full benchmark details, including additional batch sizes and chain-of-thought mode results, are available in PERFORMANCE.md.

With Moondream, you don't have to compromise. You can get sophisticated visual reasoning at near-realtime speeds, and it runs everywhere. Got a production-level vision challenge? Contact us, we'd love to talk.