Anthropic's Claude Opus 4.7 migration guide says the new tokenizer uses "roughly 1.0 to 1.35x as many tokens" as 4.6. I measured 1.47x on technical docs. 1.45x on a real CLAUDE.md file. The top of Anthropic's range is where most Claude Code content actually sits, not the middle.
Same sticker price. Same quota. More tokens per prompt. Your Max window burns through faster. Your cached prefix costs more per turn. Your rate limit hits sooner.
So Anthropic must be trading this for something. What? And is it worth it?
I ran two experiments. The first measured the cost. The second measured what Anthropic claimed you'd get back. Here's where it nets out.
What does it cost?
To measure the cost, I used POST /v1/messages/count_tokens — Anthropic's free, no-inference token counter. Same content, both models, one number each per model. The difference is purely the tokenizer.
Two batches of samples.
First: seven samples of real content a Claude Code user actually sends — a CLAUDE.md file, a user prompt, a blog post, a git log, terminal output, a stack trace, a code diff.
Second: twelve synthetic samples spanning content types — English prose, code, structured data, CJK, emoji, math symbols — to see how the ratio varies by kind.
The core loop is three lines of Python:
from anthropic import Anthropic
client = Anthropic()
for model in ["claude-opus-4-6", "claude-opus-4-7"]:
r = client.messages.count_tokens(
model=model,
messages=[{"role": "user", "content": sample_text}],
)
print(f"{model}: {r.input_tokens} tokens")
Real-world Claude Code content
Seven samples pulled from real files a Claude Code user actually sends:
Content type | chars | 4.6 tokens | 4.7 tokens | ratio |
|---|---|---|---|---|
CLAUDE.md (real file, 5KB) | 5,000 | 1,399 | 2,021 | 1.445 |
User prompt (typical Claude Code task) | 4,405 | 1,122 | 1,541 | 1.373 |
Blog post excerpt (Markdown) | 5,000 | 1,209 | 1,654 | 1.368 |
Git commit log | 2,853 | 910 | 1,223 | 1.344 |
Terminal output (pytest run) | 2,210 | 652 | 842 | 1.291 |
Python stack trace | 5,255 | 1,736 | 2,170 | 1.250 |
Code diff | 4,540 | 1,226 | 1,486 | 1.212 |
Weighted ratio across all seven: 1.325x (8,254 → 10,937 tokens).
Content-type baseline (12 synthetic samples)
For comparison across well-defined content types:
Content type | chars | 4.6 | 4.7 | ratio |
|---|---|---|---|---|
Technical docs (English) | 2,541 | 478 | 704 | 1.47 |
Shell script | 2,632 | 1,033 | 1,436 | 1.39 |
TypeScript code | 4,418 | 1,208 | 1,640 | 1.36 |
Spanish prose | 2,529 | 733 | 986 | 1.35 |
Markdown with code blocks | 2,378 | 604 | 812 | 1.34 |
Python code | 3,182 | 864 | 1,112 | 1.29 |
English prose | 2,202 | 508 | 611 | 1.20 |
JSON (dense) | 48,067 | 13,939 | 15,706 | 1.13 |
Tool definitions (JSON Schema) | 2,521 | 738 | 826 | 1.12 |
CSV (numeric) | 9,546 | 5,044 | 5,414 | 1.07 |
Japanese prose | 993 | 856 | 866 | 1.01 |
Chinese prose | 750 | 779 | 789 | 1.01 |
English-and-code subset, weighted: 1.345x. CJK subset: 1.01x on both.
What changed in the tokenizer
Three patterns in the data:
CJK, emoji, and symbol content moved 1.005–1.07x. A wholesale new vocabulary would shift these more uniformly. That didn't happen. Consistent with the non-Latin portions of the vocabulary changing less than the Latin. Token counts don't prove which specific slots were preserved.
English and code moved 1.20–1.47x on natural content. Consistent with 4.7 using shorter or fewer sub-word merges for common English and code patterns than 4.6 did.
Code is hit harder than unique prose (1.29–1.39x vs 1.20x). Code has more repeated high-frequency strings — keywords, imports, identifiers — exactly the patterns a Byte-Pair Encoding trained on code would collapse into long merges.
Chars-per-token on English dropped from 4.33 to 3.60. TypeScript dropped from 3.66 to 2.69. The vocabulary is representing the same text in smaller pieces.
That's a hypothesis, not a proof. Counting tokens doesn't tell you which specific entries in Anthropic's proprietary vocabulary changed.
I write about Claude Code internals every week - context windows, hooks, MCPs, how things actually work.
Why ship a tokenizer that uses more tokens
Anthropic's migration guide: "more literal instruction following, particularly at lower effort levels. The model will not silently generalize an instruction from one item to another."
Smaller tokens force attention over individual words. That's a documented mechanism for tighter instruction following, character-level tasks, and tool-call precision. Partner reports (Notion, Warp, Factory) describe fewer tool errors on long runs.
The tokenizer is one plausible contributor. Weights and post-training also changed. Token counts can't separate them.
Does 4.7 actually follow instructions better?
That's the cost, measured. Now the question: what did Anthropic trade for it?
Their pitch is "more literal instruction following." Plausible, but the token-count data doesn't prove it. I ran a direct test.
IFEval (Zhou et al., Google, 2023) is a benchmark of prompts with verifiable constraints. "Respond in exactly N words." "Include the word X twice." "No commas." "All uppercase." Each constraint has a Python grader. Binary pass/fail.
IFEval ships 541 prompts. I sampled 20 with a fixed seed, ran each through both models, and graded with IFEval's published checker.
The results:
Metric | 4.6 | 4.7 | Delta |
|---|---|---|---|
Strict, prompt-level (all passed) | 17/20 (85%) | 18/20 (90%) | +5pp |
Strict, instruction-level | 25/29 (86%) | 26/29 (90%) | +4pp |
Loose, prompt-level | 18/20 (90%) | 18/20 (90%) | 0 |
Loose, instruction-level | 26/29 (90%) | 26/29 (90%) | 0 |
A small but directionally consistent improvement on strict instruction following. Loose evaluation is flat. Both models already follow the high-level instructions — the strict-mode gap comes down to 4.6 occasionally mishandling exact formatting where 4.7 doesn't.
Only one instruction type moved materially: change_case:english_capital (0/1 → 1/1). Everything else tied. The one prompt that actually separated the models was a four-constraint chain where 4.6 fumbled one and 4.7 got all four.
A few caveats worth naming:
N=20. IFEval has 541 prompts. A 20-prompt sample is enough to see direction, not enough to be confident about size. A +5pp delta at N=20 is consistent with anything from "no real difference" to "real +10pp improvement."
This measures the net effect of 4.6 → 4.7. Tokenizer, weights, and post-training all changed. I can't isolate which one drove the +5pp. The causal link between "smaller tokens" and "better instruction following" remains a hypothesis.
Single generation per prompt. Multiple runs per prompt would tighten the estimate.
So: 4.7 follows strict instructions a few points better than 4.6 on this subset. Small effect, small sample. Not the "dramatic improvement" framing Anthropic's partners used in launch quotes — at least not on this benchmark.
The extra tokens bought something measurable. +5pp on strict instruction-following. Small. Real. So: is that worth 1.3–1.45x more tokens per prompt? Here's the cost, session by session.
Dollar math for one Claude Code session
Imagine a long Claude Code session — 80 turns of back-and-forth on a bug fix or refactor.
The setup (what's in your context each turn):
Static prefix: 2K CLAUDE.md + 4K tool definitions = 6K tokens, same every turn
Conversation history: grows ~2K per turn (500-token user message + 1,500-token reply), reaches ~160K by turn 80
User input: ~500 fresh tokens per turn
Output: ~1,500 tokens per turn
Cache hit rate: ~95% (typical within the 5-minute TTL)
One thing to explain upfront: the average cached prefix across the 80 turns is ~86K tokens, not 6K. The static 6K is tiny; the average history across all turns (0 at turn 1, 160K at turn 80, average ~80K) dominates. Since most of the cache-read cost happens in late turns where the history is huge, that ~86K average is what actually gets billed per turn.
4.6 session cost
Line item | Math | Cost |
|---|---|---|
Turn 1 cache-write | 8K × $6.25/MTok | $0.05 |
Turns 2–80 cache reads | 79 × 86K × $0.50/MTok | $3.40 |
Fresh user input | 79 × 500 × $5/MTok | $0.20 |
Output | 80 × 1,500 × $25/MTok | $3.00 |
Total | ~$6.65 |
Cache reads dominate input cost. Output dominates overall.
4.7 session cost
Every token in the prefix scales by its content ratio:
CLAUDE.md: 1.445x → 2K becomes 2.9K
Tool defs: 1.12x → 4K becomes 4.5K
Conversation history (mostly English and code): 1.325x → 160K becomes 212K by turn 80, averaging ~106K across the session
User input: 1.325x → 500 becomes ~660
Average cached prefix on 4.7: ~115K tokens (up from 86K). Output tokens are a wildcard — roughly the same as 4.6, up to ~30% higher if Claude Code's new xhigh default produces more thinking tokens.
Line item | Math | Cost |
|---|---|---|
Turn 1 cache-write | 10K × $6.25/MTok | $0.06 |
Turns 2–80 cache reads | 79 × 115K × $0.50/MTok | $4.54 |
Fresh user input | 79 × 660 × $5/MTok | $0.26 |
Output | 80 × 1,500–1,950 × $25/MTok | $3.00–$3.90 |
Total | ~$7.86–$8.76 |
The delta
~$6.65 → ~$7.86–$8.76. Roughly 20–30% more per session.
The per-token price didn't change. The per-session cost did, because the same session packs more tokens.
For Max-plan users hitting rate limits instead of dollars: your 5-hour window ends sooner by roughly the same ratio on English-heavy work. A session that ran the full window on 4.6 probably doesn't on 4.7.
How this hits the prompt cache
Prompt caching is the architecture Claude Code runs on.
The 4.7 tokenizer change interacts with caching in three ways:
First 4.7 session starts cold. Anthropic's prompt cache is partitioned per model — switching from 4.6 to 4.7 invalidates every cached prefix, the same way switching between Opus and Sonnet does. The tokenizer change doesn't cause this, but it makes the cold-start more expensive: the prefix you're writing to the new cache is 1.3–1.45x larger than the 4.6 equivalent.
Cache volume grows by the token ratio. 1.445x more tokens in the CLAUDE.md portion means 1.445x more tokens paying cache-write once, and 1.445x more paying cache-read every turn after. The mechanism still works. There's just more of it to pay for.
Same transcript, different count. Re-run a 4.6 session on 4.7 and your logs show a different number. If you baseline billing or observability off historical token counts, expect a step-change the day you flip the model ID.
Objections
"Input is mostly cache reads. The per-token cost barely changed."
Legitimate. In a session that stays within the 5-minute TTL, 96% of input is cache reads at $0.50/MTok — already 90% off nominal. A 1.325x ratio on the cached portion is a smaller dollar impact than on fresh input.
But Max plans count all tokens toward rate limits, not dollars. And several patterns hit uncached territory: first session after a TTL expiry, every cache-bust event (CLAUDE.md edits, tool-list changes, model switches), and every compaction event that rewrites the prefix. On those turns you pay the full ratio on the cache-write. The steady-state is a bright spot. The edges got noisier.
"Anthropic documented 1.0–1.35x as a range, not a hard ceiling."
Agreed. The real-world weighted ratio (1.325x) lands near the top of their range. Individual file types exceed it — CLAUDE.md at 1.445x, technical docs at 1.473x. That's the useful finding: the top of the documented range is where most Claude Code content sits, not the middle. Plan around the upper range, not the average.
So: tokens are 1.3–1.45x more expensive on English and code. Anthropic bought you +5pp on strict instruction following. The sticker price didn't change. The effective per-session cost did.
Is it worth it? That depends on what you send. You're paying ~20–30% more per session for a small but real improvement in how literally the model follows your prompt.
