Popping the GPU Bubble

How do you make an AI model run as fast as possible? This is a question we obsess over at Moondream HQ. The GPU handles all the math involved in model inference, so at first glance it doesn't seem like there's much to it: just tell it what to do and wait for the answer. But if you start looking at how it actually works under the hood, you find that the GPU often sits idle, not for lack of work, but because the CPU hasn't told it what to do next yet. This phenomenon is called a GPU bubble.

When a typical AI model generates text, it produces one token at a time (a token is a chunk of text, roughly a few characters). Each token depends on the tokens before it, a property called autoregressive, so generation is sequential. You can't compute the third token before you have the second. This decode loop involves a round trip between the CPU and GPU. The GPU does most of the heavy lifting to run the actual model, performing billions of arithmetic operations to produce the next token. But there's also a surprising amount of work done by the CPU. It selects which requests to run next, sets up the metadata the GPU needs for them, picks the actual token out of the model's output and records it, and more.

The challenge is that one token's worth of GPU work is small, while the CPU housekeeping is a fixed cost paid on every trip. If the GPU has to wait for that housekeeping before it can start the next token, it sits idle for part of every loop. This is why we get GPU bubbles.

In this post we're going to dive into how Photon hides these bubbles using a technique called pipelined decoding. The idea is to overlap the two kinds of work: we start GPU work on the next token while the CPU is still finishing the last one.

The bubble

Here's the shape of the problem.

Blocking vs pipelined decode timelines

In the blocking version (top), every step is a baton pass. The CPU plans and launches a forward, the GPU runs it, then the CPU synchronizes, waits for the results to land, commits them, and only then starts planning the next step. This is because the plan depends on the token we select. For example, if the model indicates it has finished answering, then we need to schedule a new pending request from our queue. The GPU sits idle waiting for the CPU to finish its commit-plan-launch work.

The fix is to pipeline the loop. Launch the next forward while the current step's token is still coming back and being committed. That's the pipelined version (bottom): the forwards run back-to-back, and the CPU work is overlapped underneath them.

The reason we can is that the token we just sampled doesn't have to leave the GPU. The next forward reads it straight from GPU memory as its input. We still want a copy on the CPU eventually, to detokenize it, stream it, and decide whether the request is done, but that is bookkeeping we can do a moment later, in the background, while the next forward already runs. Not waiting on that copy is the move that removes the bubble.

Making it safe requires three things, that we cover in the rest of this post: keeping step buffers from colliding (ping-pong slots), getting the sampling order right for constrained decoding (forward now, sample later), and cleaning up after a request finishes (zombies).

Mechanism 1: ping-pong slots

To run a decode step, the GPU needs a working set of buffers: a place to stage the input (the last generated token and its position in the sequence), a place for the model to write its output (the logits, one score per word in the vocabulary), a place to land the sampled token, and some bookkeeping the attention kernel needs to find each sequence's cached keys and values (its KV cache). We keep pinned (page-locked) host buffers on both ends, so the copies on and off the GPU run as background DMA (direct memory access) transfers instead of blocking the CPU.

These buffers are allocated once and reused on every step. We work hard to avoid performing GPU memory allocations at runtime, because they can cause device synchronization and introduce bubbles. Fixed buffer addresses are also needed for capturing the decode step once as a CUDA graph and replaying it, reducing kernel launch overhead. We call this bundle a DecodeSlot.

This works, but introduces a blocker for pipelining. The buffers stay in use until the step is done, so we cannot start the next step until the current one finishes. To overlap two steps, the second step needs its own working set, otherwise it can overwrite the results of the first step before the CPU has read them. So we keep two slots and alternate between them, ping-pong style.

Ping-pong slots

One thing to note about launch: we don't execute kernels the instant we issue a launch from CPU. Instead, we enqueue them onto a stream -- an ordered queue that the GPU drains in order. Work on the same stream runs sequentially, while work on separate streams can overlap. Both slots put their forwards onto the same compute stream. The slots are not for GPU parallelism. They only exist so the CPU can process one slot's results while the GPU runs the other slot's forward.

The forwards all share that one compute stream, but the copies do not. Each step's device-to-host copy, the one that brings the sampled token back for bookkeeping, goes on a separate copy stream, so it can run while the GPU is busy with the next forward. That is what lets us not wait for it. We anchor the copy to an event recorded the instant the step's outputs are written, so it waits on exactly that step's work and nothing queued behind it.

The copy runs in the background

A slot only becomes free once its results have been read, not just once the GPU is done with it. Its pinned host buffer is the landing site for a copy that may still be in flight, so handing the slot to a new step too early would overwrite a copy mid-transfer, creating a hard-to-debug corruption bug. So the slot stays reserved through the commit that reads it, and is released only once that commit has finished.

Mechanism 2: forward now, sample later

The next forward can run ahead because it doesn't depend on anything the CPU does with the last token. But two things about the next step do depend on the last step's committed result. One is which sequences are still in the batch: if a request just finished, it shouldn't be in the next forward. That is the next section (zombies). The other is what tokens the next step is even allowed to sample, and that one is this section.

It comes from constrained decoding. Moondream's spatial skills return structured output instead of free text: point returns a coordinate, detect returns boxes, segment returns an outline. We get those from the same decode loop by restricting which tokens the model may produce at each step: we force the scores (the logits) of the disallowed ones to negative infinity before we sample. A point step has to emit a coordinate, a detect request walks an x, y, size cycle, and so on. Which tokens are allowed, the mask, depends on what has been produced so far, so the mask for step t+1 depends on the token we sampled at t.

The dependency is in sampling, not in the forward.

The forward needs no mask; only sampling does

Each scheduler tick goes through three phases: launch, commit, and finalize:

Launch the forward for t+1. It doesn't depend on the mask, so it goes immediately.
Commit step t: wait on the in-flight copy and advance the request's decode state. That is needed to decide the mask for t+1.
Finalize sampling for t+1: with the state current, build the mask and sample.

Sampling t+1 lands after committing t because the commit is what makes t+1's mask correct. We call this "commit-before-finalize" ordering. The GPU runs the t+1 forward through steps 2 and 3, so the commit disappears from the critical path.

For plain text there is no mask, so forward and sampling can both run a step ahead. For constrained sequences the forward still runs ahead, but sampling waits on the previous commit, which caps how far ahead we get with no special-casing. One loop handles both.

Mechanism 3: zombies: finalize early, release late

Back in forward now, sample later we flagged two ways the next step depends on the last step's committed result. The sampling mask was one. Batch membership is the other, and it takes a bit of care to handle right.

To launch step t+1 we first decide its batch, which sequences are in it, and we do that before committing step t. So what happens when a sequence hits its stop token at t, but is already baked into t+1's forward? You can't un-launch GPU work. The sequence is finished, yet still physically present in a batch that's executing.

Photon calls these zombies, and instead of bolting on cancellation logic, it lets the behavior emerge from two per-sequence fields:

finalized: True after the sequence has hit EOS or its length cap.
inflight_refs: the number of in-flight steps that still reference this sequence (0, 1, or 2).

A finished sequence rides step t+1 as a zombie

When step t commits and detects EOS, the sequence is marked finalized and its result is emitted — but it isn't torn down, because inflight_refs is still nonzero (step t+1 references it). At step t+1's commit, the sequence is already finalized, so the commit is skipped: no token is appended, no state mutates. The zombie was harmlessly along for the ride — it occupied its slot and wrote some KV that nobody will read. Only when inflight_refs finally hits 0 are its KV pages and LoRA slot released.

This finalize-early, release-late dance is a small amount of refcounting that replaces what would otherwise be a thicket of "cancel this row mid-flight" special cases.

Prefill rides the same pipeline

So far this has all been about decode steps, but a real serving loop is constantly doing two different kinds of work: prefill (processing a new request's prompt + image, the expensive one-shot forward over many tokens) and decode (one token at a time for everyone already running).

Photon doesn't separate them. A prefill is just another kind="prefill" launch in the same two-slot pipeline. Because the pipeline only cares that a slot is free, not what kind of work last used it, a prefill forward can be launched into one slot while a decode step from the other slot is still being committed, and vice versa. The expensive prefill forward runs on the GPU while the CPU commits decode results; the next decode forward runs while the CPU finishes admitting the just-prefilled request. The same commit ordering (and the same inflight_refs bookkeeping) keeps everything correct across the two kinds, so none of the zombie or constrained-decode logic needs a special case for "what if a prefill is in flight."

This matters most when outputs are short. A request that emits three tokens spends almost all of its life in prefill and admission, not decode, so a workload of many short requests is really a stream of prefills with a little decode sprinkled in. Sharing one pipeline is what lets that stream overlap its own CPU bookkeeping instead of serializing prefill behind decode and back again.

A cost model for the bubble

How much should pipelining actually buy you? You can predict it from the parts of a decode step, and then check the prediction against measurement.

A decode step is three pieces of work:

forward: the heavy GPU matmuls. At decode this is memory-bandwidth bound: every token streams the whole weight set through the cores, so it has a floor near weight_bytes / memory_bandwidth. It shrinks as memory gets faster or as the model gets smaller.
sampling: turning the scores into a committed token: the constrained-decode mask, the argmax/sample, the spatial (grounding) decode, and the device→host copy of the result. All GPU work.
bookkeeping: the CPU around it. Choose the next batch (plan), launch the graph (launch), commit the previous step (commit).

A blocking loop runs the three in series, so the GPU sits idle through the bookkeeping — that idle is the bubble. Pipelining slides the bookkeeping of one step underneath the forward + sampling of the next, so the period collapses toward forward + sampling and the bubble disappears. Measured per step, pipelined, that's exactly what we see — the GPU is busy for essentially the whole period (steady-state medians, moondream2, ms):

	forward (ms)	sampling (ms)	period (ms)
3090 · 1 stream	4.87	0.20	5.10
8 streams	6.66	0.27	6.97
32 streams	10.24	0.26	10.52
B200 · 1 stream	2.45	0.14	2.63
8 streams	3.12	0.14	3.30
32 streams	3.80	0.14	3.98

forward + sampling ≈ period; the leftover GPU idle is under 0.05 ms. So what was hiding it worth? It comes down to a tug-of-war between two things — how much of a step you manage to tuck away, against a small penalty for running ahead:

speedup  =   T_block / T_pipe    ×      (1 − z)
            └─ bubble hidden ─┘     └─ zombie tax ─┘

Two symbols, two ideas. The first term is the win, and it's the whole GPU-speed story: how long a step takes blocking (T_block) over how long it takes pipelined (T_pipe) — i.e. how much faster the step runs once the bookkeeping is tucked underneath it.

The second, z, is the price of running ahead — the zombie tax from Mechanism 3. Launch step t+1 before committing t, and a sequence that just finished still has a forward in flight: a wasted step. On a single stream that's one wasted forward for every L tokens the request generated, so about 1% at L ≈ 110. Pack a batch, though, and it nearly vanishes — the zombie is just one more row in a step that's already paying full price to stream the weights, so it rides along almost free. The tax bites hardest at one stream and fades exactly where throughput lives, which is why predicting it needs both L and the batch size.

Here's that step, measured both ways — blocking idles each step while the CPU commits the last token and re-launches; pipelining runs that work (and the async mask upload) underneath the forward, so the forwards never stop:

Blocking vs pipelined decode, measured per-step on a B200

Now put real numbers in it. Measure each piece on its own — the two step times and L — and the model's prediction should land on what the benchmark actually delivers (depth-1 blocking vs depth-2 pipelined, nothing else changed):

	blocking (ms)	pipelined (ms)	L	predicted	observed
3090 · 1 stream	5.44	5.10	104	+5.7%	+6.5%
8 streams	7.52	6.97	113	+7.6%	+7.8%
32 streams	11.74	10.52	113	+11.1%	+11.6%
B200 · 1 stream	3.11	2.63	115	+17.2%	+17.6%
8 streams	4.04	3.30	115	+22.2%	+21.9%
32 streams	5.55	3.98	104	+39.1%	+35.4%

Three things to read out of it:

The win grows with GPU speed. Same workload, +12% on a 3090 but +35% on a B200 at 32 streams. The bookkeeping is GPU-speed-independent, so as the forward shrinks — faster memory, or a smaller model — the bubble is a bigger share of the step. Pipelining is insurance against the GPU getting faster, which for us is the same thing as the model getting smaller.
The zombie tax is real but small, and it amortizes. At one stream the zombie is a whole wasted forward — about 1% at L≈110. At batch it's one extra row in a step that's memory-bound on the weights, not the row count, so it costs almost nothing: at 32 streams the 3090's observed +11.6% lands right on the no-zombie per-step ratio. The tax bites at a single stream and fades exactly where throughput lives. (The B200's 32-stream row sits a few points under prediction for a duller reason — at ~4 ms/step the whole run is under half a second, so prefill and the end-of-run batch ramp-down are a visible slice of the wall.)
It only pays once the bubble is actually hideable. (This is how we caught a bug, in fact: the pipelined numbers came out at blocking speed, traced to an accidental synchronous copy while building the constrained-decode mask. Moving it to the copy stream was worth +11% on the 3090 and +34% on the B200.)

It's never just one thing

That's the whole technique: ping-pong slots so two steps don't collide, a forward/sampling split so even constrained decoding can run ahead, and a little zombie refcounting so finished requests tear down cleanly. The GPU stops waiting on the CPU, and you get back anywhere from a few percent to a third; more the faster your accelerator/model is.

But Photon isn't fast because of this one technique, or any single technique. It's fast because dozens of these details compound across the serving stack: how we resize and tile images on the way in, the kernels that run the model, the scheduler ordering here, and the synchronization points we remove from the hot path. No one piece is the whole story; the stack gets fast when enough of them line up.

We'll keep writing these up, one corner of the stack at a time. Follow us on Twitter so you don't miss the next one. And keep an eye out for Photon 2.0, coming soon: we can't share details yet, but it's a big one.

Moondream Engineering