Update, March 13, 2026: Anthropic removed the long-context pricing premium for Opus 4.6 and Sonnet 4.6. The 1M window is now GA at standard rates — no 2x multiplier. The pricing cliff below no longer exists for these models.

New recommendation: use 1M whenever you need it. It's free to select and costs nothing extra per token.

I Measured Claude's 1M Context Window. For Long Sessions, Stick to 200K.

Claude Opus 4.6 and Sonnet 4.6 now support a 1 million token context window — roughly 750,000 words, a mid-sized codebase, or a 200-turn session without ever hitting compaction.

I sent requests at increasing context sizes - 50K, 100K, 200K, 400K, 600K and measured retrieval accuracy, latency, cost, and cache behavior at each step. I planted needles in haystacks.

The 1M window is genuinely useful, but not for the reasons most people think, and not for most sessions.

The short version: For most daily Claude Code sessions, you don't need the 1M window. The one case where it genuinely helps is feeding a large codebase or document in a single request. For long conversations, it costs 3x more and the quality tradeoffs are real. The rest of this post is the evidence.

How context windows work (the short version)

Every time you send a message in Claude Code, the entire conversation gets re-sent to the API. Turn 1 sends the system prompt, tools, CLAUDE.md, and your message. Turn 50 sends all of that plus 49 rounds of conversation. I covered this in detail in the prompt caching deep dive.

The context window is the hard limit on how much the model can hold in a single request. At 200K tokens, you hit that limit around turn 50-80 (depending on how much tool output accumulates). When you hit it, Claude Code fires auto-compaction where it summarizes the conversation, drops the history, and continues from the summary.

At 1M tokens, you have 5x the space. In theory, you'd never compact. Everything stays in memory. No summaries, no lost context, no quality degradation from forgotten instructions.

Free Claude Code crash course

60-min video lesson + CLAUDE.md starter kit. Yours when you subscribe.

Why the 1M window took this long to build

If a 1M context window is so useful, why didn't models have it three years ago? The answer isn't that anyone was holding back. It's that the core algorithm inside every transformer makes long sequences genuinely expensive — and getting around that took real engineering work.

The O(n²) problem

Every transformer processes text by having each token compare itself to every other token. It's called attention, and it's what makes these models good at reasoning about relationships across long text.

The problem: it scales quadratically. Double the sequence length, quadruple the memory. That's the classic O(n²) curve that every engineer learns to avoid.

Early models weren't limited by imagination. They were limited by hardware. In 2020, training GPT-3 at 4K context already pushed what was affordable. A 100K context would have cost hundreds of times more per training run.

The unlock: an algorithmic fix (2022)

In 2022, a Stanford researcher published a paper called FlashAttention. The insight wasn't new math — the result is identical to standard attention. The insight was that attention is slow not because of the computation itself, but because of how much data gets shuttled between fast on-chip memory and slow off-chip memory.

If you've ever optimized a tight loop by improving cache locality, the idea is the same. FlashAttention tiles the computation so it stays in fast memory longer. Memory usage drops from O(n²) to O(n). Training long sequences becomes tractable.

This is the single biggest reason 100K, 200K, and 1M contexts became possible. The hardware didn't change much. The algorithm did.

Context window evolution: the timeline

FlashAttention opened the door. What followed was a rapid expansion — each major jump enabled by a combination of better training infrastructure and the willingness to spend compute on longer sequences.

Claude 2's jump to 100K in 2023 was the first major inflection point - a 25x leap from GPT-3.5 that nobody saw coming. Gemini 1.5's jump to 1M in 2024 was the second and critically, Google built that model with 1M in mind from the start of pretraining.

That distinction matters, as we'll see next.

Having a big window and being able to use it are different things

A model doesn't just need to receive a long context. It needs to have learned how to use one.

During pretraining, the model sees sequences up to some maximum length and learns to handle them. Feed it anything longer, and it hits positions it has never seen before. Performance drops, sometimes sharply.

Think of it like asking a developer who's only ever reviewed 100-line diffs to suddenly review a 10,000-line diff. The skill doesn't automatically transfer.

Opus scores 93% at 256K and 76% at 1M. Sonnet 4.5 drops to 18.5% at 1M. Just having a 1M window doesn't mean the model can use it.

Why the middle of your context gets ignored

One more effect worth knowing: models don't pay equal attention to everything in their context window. They consistently over-attend to the beginning and end, and under-attend to the middle. Researchers call this "lost in the middle."

The code review analogy holds here: if you're reviewing a 500-line PR, the first few files grab your attention, the last few are fresh in your mind, but the middle files tend to get skimmed. Models have the same bias — and it's baked into the weights during pretraining.

In practice: if you're using 1M context and have critical information the model must use, put it at the beginning or end of your context. Middle placement is a gamble.

How to enable it

In Claude Code

Use the /model command with the [1m] suffix:

/model opus[1m]
/model sonnet[1m]
/model claude-opus-4-6[1m]
/model claude-sonnet-4-6[1m]

If your account supports 1M context, these options appear in the model picker.

Billing doesn't change just because you selected a 1M model. You pay standard rates until your context exceeds 200K tokens. After that, premium rates kick in.

The pricing cliff

This is the most important thing to understand about 1M context. The pricing is not linear. It's a step function with a hard threshold at 200K input tokens.

Why long context costs more to serve

Before the pricing table: the 2x multiplier isn't arbitrary. There's real infrastructure behind it.

Every token in your context requires the model to store a small chunk of numerical state in GPU memory — think of it like RAM for the model's working memory. At standard context sizes, Anthropic can run many sessions in parallel on the same hardware. Each session's memory footprint is manageable.

At 1M tokens, each active session needs hundreds of gigabytes of GPU memory just to hold that state. Not shared with other users - reserved for your session alone, for as long as you're in it. Fewer sessions fit per server, so each session costs more.

Then there's the cold start problem. When you first open a 1M session with no cache, the model processes every token from scratch. At 1M tokens, that's roughly 25x more compute than a 200K cold start and it all happens on Anthropic's servers before your first token arrives.

The cache changes the picture significantly once it's warm. Cached reads skip the recomputation - that's why TTFT drops from 35 seconds to 3.5 seconds at 500K context (see Experiment 3). But the memory cost doesn't go away. Even cached sessions are holding all that state in GPU RAM.

The 2x premium is Anthropic passing through the infrastructure math. More GPU memory per session + more compute for cold prefill = higher cost per request.

Standard (up to 200K input)

Long context (over 200K input)

Multiplier

Opus 4.6 input

$5.00/M

$10.00/M

2x

Opus 4.6 output

$25.00/M

$37.50/M

1.5x

Sonnet 4.6 input

$3.00/M

$6.00/M

2x

Sonnet 4.6 output

$15.00/M

$22.50/M

1.5x

The critical detail: When you cross 200K, ALL tokens get premium pricing. Not just the ones above the threshold. A request with 201K input tokens pays the premium rate on all 201K tokens.

199,000 input tokens on Opus:  199K × $5.00/M  = $0.995
201,000 input tokens on Opus:  201K × $10.00/M = $2.010

Cost to cross the threshold: $1.015 for 2,000 extra tokens.
Effective rate for those 2K tokens: $507.50/M

This matters for prompt caching too. Cache reads and cache writes both get the premium multiplier once you're in long-context territory:

Category

Standard

Long context

Opus cache read

$0.50/M

$1.00/M

Opus cache write

$6.25/M

$12.50/M

Sonnet cache read

$0.30/M

$0.60/M

Sonnet cache write

$3.75/M

$7.50/M

The 90% cache discount still applies at long-context rates, but it's 90% off a higher base price. Cache reads on Opus go from $0.50/M to $1.00/M.

One exception: fast mode. Fast mode pricing ($30/$150 per MTok for input/output) applies uniformly across the full 1M context window — no additional long-context surcharge stacks on top. But fast mode is always billed as extra usage, even on Max subscriptions. It's not included in your plan's rate limits. You're trading the 2x long-context multiplier for the 6x fast mode multiplier.

The experiments

I wrote a script that makes real API calls and measures what happens as context grows. Here's what I found.

Experiment 1: The pricing cliff in practice

I sent requests at increasing context sizes — 50K, 100K, 150K, 199K, and 250K tokens and logged the usage fields from the API response.

The first four requests stay under 200K and pay standard rates. The fifth crosses the threshold.

Between 199K and 250K, the cache read cost jumps from $0.099 to $0.250 — a 2.5x increase for 25% more context. The step function is real and abrupt.

Experiment 2: Needle in a haystack

I planted a specific fact - a made-up employee ID and salary at various positions within increasingly large contexts filled with procedural corporate text. Then I asked the model to retrieve it.

The "haystack" is synthetic business documents: quarterly reports, HR policies, engineering specs. The "needle" is one sentence: "Employee Sarah Chen (ID: EMP-7429) received a performance bonus of $8,750 in Q3."

The script tests 50K, 100K, and 200K by default. The 400K and 600K rows require tier 4 access (RUN_LARGE=1) and cost $5+ each at Opus long-context rates.

Opus 4.6 results:

Context    Needle at 25%    Needle at 50%    Needle at 75%
──────────────────────────────────────────────────────────
  50K      ✓ correct        ✓ correct        ✓ correct
 100K      ✓ correct        ✓ correct        ✓ correct
 200K      ✓ correct        ✓ correct        ✓ correct
 400K*     ✓ correct        ✓ correct        ✓ correct
 600K*     ✓ correct        ✓ correct        ~ partial

* Requires tier 4 access (RUN_LARGE=1)

Opus 4.6 nails it up to 400K with perfect accuracy. At 600K, retrieval at 75% depth starts getting fuzzy — it returns the correct employee name but sometimes hallucinates the bonus amount or misattributes it.

Sonnet 4.6 results:

Context    Needle at 25%    Needle at 50%    Needle at 75%
──────────────────────────────────────────────────────────
  50K      ✓ correct        ✓ correct        ✓ correct
 100K      ✓ correct        ✓ correct        ✓ correct
 200K      ✓ correct        ✓ correct        ~ partial
 400K*     ~ partial        ✗ missed         ✗ missed
 600K*     ✗ missed         ✗ missed         ✗ missed

* Requires tier 4 access (RUN_LARGE=1)

Sonnet degrades much faster. By 400K, it's unreliable. By 600K, it's guessing.

Anthropic's own MRCR v2 numbers confirm this. On the 8-needle 1M variant, as published in Anthropic's Opus 4.6 announcement:

Model

MRCR Score

Opus 4.6

76.0%

Gemini 3 Pro

26.3%

Sonnet 4.5

18.5%

A note on the Gemini number: Gemini 3 Pro is a newer model than Gemini 1.5 Pro — the one that was specifically pretrained for 1M context from scratch and achieved near-perfect recall in Google's own tests. These are different generations of the same family; the 26.3% is Gemini 3 Pro's score as measured by Anthropic on their benchmark, not a reflection of what Gemini 1.5 Pro achieved.

Opus scores 93% at 256K and 76% at 1M. Sonnet 4.5 drops to 18.5% at 1M. Anthropic hasn't published MRCR scores for Sonnet 4.6 yet; it may perform better than 4.5, but until we have numbers, caution is warranted. Just having a 1M window doesn't mean the model can use it.

Researchers call this "context rot" - the well-documented pattern where attention quality degrades as sequences grow longer. Every transformer has it to some degree. Opus 4.6 resists it much better than previous models.

Experiment 3: Latency

I measured time to first token (TTFT) at increasing context sizes using streaming requests. For each size, I ran three calls: a cache-priming write, a cached read (warm), and an uncached cold request with a unique system prompt to prevent cache reuse.

The script tests 50K, 100K, and 200K by default. The 300K and 500K rows require tier 4 access (RUN_LARGE=1). Run the experiment yourself to get numbers for your network conditions — the table below shows the pattern, not universal constants.

Context     TTFT (cached)    TTFT (cold)
────────────────────────────────────────
  50K        ~0.8s            ~2s
 100K        ~1.1s            ~4s
 200K        ~1.6s            ~9s
 300K*       ~2.2s            ~16s
 500K*       ~3.5s            ~35s

* Requires tier 4 access (RUN_LARGE=1)
  cold = no cache, model processes all tokens from scratch
  cached = cache warm from previous request

Two things stand out:

  1. Cold prefill scales super-linearly. The more context, the disproportionately longer the wait. At 500K cold tokens, you're waiting 30+ seconds before the model starts responding. At 1M, extrapolating from this curve (power law exponent ~1.24), you're likely waiting 60-90 seconds before the model starts responding.

  2. Cached prefill is fast. With a warm cache, even 500K context only adds a few seconds. The model loads its saved state instead of reprocessing everything from scratch. This is what prompt caching is under the hood — the model stores the processed state of your context so that subsequent turns don't have to re-read it all. (If you haven't read the prompt caching deep dive, this is worth understanding before going deep on 1M context.)

The practical implication: your first message in a 1M context session is slow. Subsequent messages are fine — as long as the cache stays warm (5-minute TTL, reset on each use).

If you go AFK for 6 minutes mid-session and the cache expires, your next message at 500K context will take 30+ seconds to start responding. That's the cold restart penalty.

When to use the 1M window

Use it for:

Single-shot large document analysis — feed an entire codebase, contract, or research corpus in one request. The model reads everything once, reasons over it, and responds. Context rot is minimal because there's no multi-turn conversation diluting attention. This is the use case it was built for.

Deep debugging sessions where you can't afford to lose context. When you're chasing a bug across 15 files and the full stack trace, reproduction steps, and failed hypotheses all matter. Compaction would destroy exactly the information you need.

Agent teams with large shared state. When multiple agents are reading files, sharing findings, and building on each other's work, the accumulated context grows fast. The 1M window lets a team lead hold all agent reports without compacting.

Compliance and audit work where exact quotes matter. When you need the model to cite specific passages from a 300-page contract or a large policy document, 1M lets you put the whole thing in context at once rather than chunking it into pieces and hoping nothing falls through the gap.

Don't use it for:

Routine Claude Code sessions. My data showed most sessions peak at 80-120K context before compaction. They never approach 200K, let alone need 1M. You'd be selecting a 1M model and paying standard rates anyway — same as the 200K model.

Long sessions where you'd benefit from a fresh start anyway. After 80+ turns, the model often benefits from a clean slate. Stale context from early turns can actively hurt. The model wastes attention on irrelevant early exploration instead of focusing on the current task. A /clear + fresh start is often better than both compaction and 1M context.

Sonnet 4.5 at 1M. Sonnet 4.5's MRCR score is 18.5% at 1M — you'd be paying 2x premium rates for context the model can barely use. Sonnet 4.6 may be better (no published MRCR scores yet), but until we have numbers, default to Opus for long-context retrieval tasks.

Sessions where you go AFK frequently. The 5-minute cache TTL means a cold restart at 500K context takes 30+ seconds. At 1M, extrapolating from the latency data, you're looking at 60-90 seconds. If you're context-switching between tasks and coming back after coffee breaks, the latency penalty is painful.

The decision framework

Five things worth knowing before you flip the switch

  1. Selecting opus[1m] costs nothing extra while your context stays under 200K. There's no harm in having it available — the 1M model behaves identically to the standard model until you cross the threshold.

  2. Once you cross 200K, the premium is 2x on all tokens including cached ones. A session at 400K context costs roughly 5x per turn compared to a post-compaction session at 80K. The compaction tax ($0.21) is cheap; the long-context tax is not.

  3. Use Opus, not Sonnet, for long-context work. Opus 4.6 scores 76% on MRCR at 1M tokens; Sonnet 4.5 scores 18.5%. You're paying premium rates either way — Opus can actually use the context.

  4. The sweet spot is single-shot analysis. Feed a large codebase or document corpus in one request. Context rot is minimal, you pay premium rates once instead of per-turn, and there's no cache management to worry about.

  5. For most coding tasks, better session management beats a bigger window. Intentional /clear boundaries, subagents for exploration, the dump-and-clear pattern — these keep context small, caches warm, and costs low. A 50K context with sharp attention will outperform a 500K context with diluted attention on most coding tasks.

Keep Reading