How Claude Code Fast Mode Works
Toggle /fast and Claude answers about 2.5× quicker — for 2 to 6× the price. Same model, same answer. Here's the mechanism, and when it's worth turning on.
Fast mode isn't a smaller or dumber model. Anthropic is explicit: "Fast mode is not a different model." Same weights, same intelligence, the same answer it would have given anyway. The only thing that changes is how quickly the tokens come out — and what you pay for them.
To see why those two move together, you need to know how Claude answers a question at all.
Two phases: prefill and decode
Every response has two phases with completely different physics.
Prefill reads everything you sent — your message, the files, the whole conversation — in one forward pass. It's quick, and it scales well because all your input tokens go through in parallel.
Decode generates the answer one token at a time, each token conditioned on every token before it. It's sequential, and it's where nearly all the wall-clock goes.
Why decode is the slow part

Each decode step streams the model's entire weight set out of GPU memory (HBM) to compute a single token, then does it again for the next. For a frontier model that's hundreds of gigabytes moved per token.
The key fact: decode is memory-bandwidth-bound, not compute-bound. The arithmetic for one token is trivial next to the cost of moving the weights, so the GPU's compute units sit mostly idle, stalled on memory. Prefill is the opposite — it pushes all your input tokens through one weight load in parallel, saturating the FLOPs.
That idle compute during decode is the slack every inference provider is built to exploit.
The fix: batching

One sequence can't keep the GPU busy during decode. So providers run many users' sequences through the same forward pass — load the weights once, apply them to dozens of sequences at the same time. The expensive part, moving the weights, gets amortized across the whole batch.
That's the entire economics of serving a frontier model. Bigger batch means more tokens per weight-load, which means higher throughput and lower cost per token. It's why Opus is affordable at all.
But batching trades latency for throughput. Your tokens come out at the batch's shared cadence, and as the batch grows, each step has more sequences to attend over and more memory to move — so any one user's tokens decode slower. Standard Opus runs large batches: cheap, high-throughput, slower for you.
What fast mode does differently
Fast mode puts you in a much smaller batch — likely on reserved capacity. Fewer sequences share each forward pass, so your tokens decode faster: higher output tokens per second (OTPS), up to 2.5×.
The cost is the same lever from the other side. That per-step weight-load now amortizes across a handful of sequences instead of a full batch, so your share of each GPU-second jumps. You're not buying more work — the tokens are identical to standard Opus — you're buying a less-shared slice of the GPU. That's the premium: 2× on Opus 4.8, 6× on 4.7.
(The 2.5× is Anthropic's stated number, not independently measured. Smaller batch size is the most likely mechanism given the OTPS-up / TTFT-flat behavior; Anthropic hasn't confirmed internals.)
The catch: it finishes faster, it doesn't start faster

Fast mode speeds up decode. It doesn't touch prefill — so the wait before the first token (TTFT) is unchanged. Shrink the batch and decode speeds up; time-to-first-token stays put.
Which means fast mode only helps when there's a lot to write. On a short reply, both speeds finish almost together — you paid 2–6× to save a blink. On a long generation, fast pulls clear. The longer the answer, the more the premium buys you.
When to use it
Turn it ON for:
Long generations where decode dominates — refactors, scaffolding, anything that emits thousands of tokens
Hands-on work where you're waiting on Claude and your time is the expensive part
Time-boxed work where finishing sooner is worth real money
Leave it OFF for:
Short turns — questions, confirmations, one-liners (TTFT-bound; fast barely helps)
Background or autonomous runs nobody is watching
Batch jobs and CI (it's not available there anyway)
The one trap: don't flip it on mid-session
Fast and standard keep separate prompt caches. The moment you switch, your warm standard cache is useless to the fast path — the entire conversation prefix gets re-processed as a cache write at the fast rate, before Claude writes a token.

The deeper into a session you flip it, the bigger the one-time hit:
Context when you toggle | Re-bill (Opus 4.8) | Re-bill (Opus 4.7) |
|---|---|---|
50K tokens | $0.62 | $1.88 |
100K tokens | $1.25 | $3.75 |
200K tokens | $2.50 | $7.50 |
260K (measured) | $3.26 | $9.77 |
I measured the 260K row live: flipping /fast on a 260K-token session re-billed 260,591 tokens as a fast-rate cache write — $3.15 of one-time premium before any useful output.
The fix: turn fast mode on at the start of a session, not at turn 80. Toggling off and back on later doesn't repeat the charge — the fast cache is already built.
A few edges
On a subscription, fast mode bills from usage credits only, at the fast rate from the first token.
Hit the fast rate limit and it quietly drops back to standard speed and price.
Fast mode is a research preview; pricing and availability may change.
That's the deal: fast mode is the same model on a less-shared slice of GPU. Faster and pricier are one fact, not two.
