Images cost 3x more tokens in Claude Opus 4.7

Claude Opus 4.7 reads images up to 2576 pixels on the long edge. Earlier models capped at 1568. A reader asked what image tokenization means and whether the higher ceiling is worth it. Short answer: it depends on what you're doing with the image. The rest of this post is how to tell.

How image tokenization works

When you send Claude an image, the API doesn't read pixels directly. It converts the image into tokens — the same unit of measurement Claude uses for text and bills you for them.

The public formula is simple:

tokens ≈ (width × height) / 750

A 1000×1000 image is ~1,334 tokens. A 2000×2000 image is ~5,334 tokens. Pay-per-pixel, roughly.

Two details matter.

First, every Claude model has a max image resolution. Before 4.7, that limit was 1568 pixels on the long edge. Opus 4.7 raised it to 2576 pixels.

Second, if your image is larger than the max, the API downscales it before counting tokens. From Anthropic's vision docs:

If your image's long edge is more than 1568 pixels, or your image is more than ~1,600 tokens, it is first scaled down, preserving aspect ratio.

This means on Claude 4.6 and earlier, every image above ~1.15 megapixels arrived at the model's vision encoder at the same reduced resolution, no matter how big you uploaded it. The bill capped around 1,560 tokens per image. Upload a 6MP screenshot, get billed for the 1.15MP version.

I measured this directly. Four images, two long-edge targets (1568 and 2576), both models, sixteen count_tokens calls. The 4.6 column was flat:

Image	4.6 at 1568px	4.6 at 2576px
Code/UI screenshot	1,242	1,242
Dense UI dashboard	1,558	1,558
Technical diagram	1,560	1,560
Unsplash photo	1,561	1,561

Same image, twice the pixels on the long edge, four times the pixel count, identical token count. The downscale is doing exactly what the docs say.

What changed in Opus 4.7

Three things.

First, the max resolution went from 1568 to 2576 pixels on the long edge. That's about 2.6x the pixel count at the new ceiling.

Second, coordinate mapping is now 1:1 with actual pixels. If you ask 4.7 to point at something in an image, the coordinates it returns are the coordinates in your original image — no scale-factor math needed. This is new in 4.7 and genuinely useful for computer-use agents that need pixel-accurate clicks.

Third, the vision benchmarks jumped. Anthropic's release notes report ScreenSpot Pro moving from ~70 to 87.6%, visual reasoning 69.1 → 82.1%, and document-analysis scores climbing across the board. The model actually sees better, not just bigger.

The tokenization approach itself didn't change. Same (w × h) / 750 formula, same patch-based vision encoder (as best as I can tell from the outside). What changed is the cap on how many pixels it gets to work with.

The cost

Removing the cap means a full-resolution image at 2576px on the long edge now bills for the pixels you send, not the 1.15MP version of them. Here's what that looks like across the same four images:

Image	4.6 tokens	4.7 tokens	Ratio
Code/UI screenshot	1,242	3,234	2.60x
Dense UI dashboard	1,558	4,739	3.04x
Technical diagram	1,560	4,766	3.06x
Unsplash photo	1,561	4,745	3.04x

At Opus input pricing of $15 per million tokens, a typical screenshot jumps from $0.023 to $0.071. A session with 50 images goes from $1.17 to $3.55 in image input alone. Ship vision at scale and the bill has roughly tripled.

This lands in a moment where token burn is already the dominant Claude Code conversation. Max 20x users burning through monthly caps in days, prompt-cache regressions, effort-level defaults that burn tokens through retry loops. If you're already watching your bills, image costs are another thing to track.

Does the higher resolution actually help?

The new ceiling costs more. The question is whether it does more.

I ran two capability tests.

First, text extraction on a dense analytics dashboard — 42 distinct metrics, labels, percentages, micro-copy, and a disclaimer footer. Two runs per resolution.

1568px (2,724 input tokens): 41 unique items
2576px (4,817 input tokens): 42 unique items
Intersection: 41 items, identical

The one item only-at-2576 was "hi abhishek 🙂" — a greeting the 1568 pass had merged into an adjacent block. Not fine print the lower resolution failed to read.

That's one image. Fair critique: maybe the dashboard didn't have text small enough to break 1568. So I built a second test.

A 2576×1800 canvas with 12 target strings at six font sizes (100, 48, 24, 16, 12, 8 px) and two content types: English words ("ZIRCONIUM", "BASALT") and random codes ("9K7-QZ4", "D7F-2LB"). The random codes matter — a language model can reconstruct "BASALT" from partial visual signal using priors, but it can't do that for "D7F-2LB." Downscale to 1568 long edge and 8px source text renders at ~4.9 effective pixels, which is where OCR should break.

Across 2 resolutions × 2 runs × 12 cells = 48 extractions:

Font size at source	1568px correct	2576px correct
100px (word + code)	2/2	2/2
48px (word + code)	2/2	2/2
24px (word + code)	2/2	2/2
16px (word + code)	2/2	2/2
12px (word + code)	2/2	2/2
8px (word + code)	2/2	2/2

Every cell passed. Even 8px random codes read clean after the 1568 downscale. For OCR-flavored tasks — reading text off images — 1568px saturates. The extra pixels don't help.

That doesn't mean they're not helping somewhere.

Where the extra pixels matter

Anthropic's 4.7 release notes name three workloads for the high-resolution support: computer use, screenshot/artifact/document understanding, and the new 1:1 pixel-to-coordinate mapping. Those aren't marketing flourishes — they're the workloads where the extra pixels pay off.

Computer-use agents navigate live interfaces through screenshots. They need to click a specific button, drag a specific element, read a specific timestamp in the corner of a tab. The 1:1 pixel mapping means an agent can say "click (847, 312)" and the coordinate is in the actual image — no scale math. Combined with the higher resolution, agents reading dense UIs can now see individual pixels that a 1568px downscale blurs together. This is a real capability gain for any agent workflow that operates through a browser or an app.

Claude Design, which Anthropic shipped the morning after 4.7, generates prototypes, slides, and UI mockups from uploaded images, documents, and codebases. Its web capture tool grabs UI elements from live sites so generated prototypes match the real product. It applies a team's design system by reading design files and code. That workload needs pixel-level typography fidelity - the difference between 14px and 13px body text, kerning, grid alignment which a 1568 downscale does not preserve on a 4K design reference.

Document analysis covers dense PDFs, multi-column reports, technical drawings with annotation layers, and charts with small labels. Anthropic's OfficeQA Pro score jumping from 57.1% to 80.6% suggests the resolution boost is doing real work on paper-style documents.

None of those three workloads is "sending a screenshot to a coding agent." And that's the practical split.

What to do

If you're building computer-use agents, using Claude Design, or doing document analysis on dense PDFs — the 2576px ceiling is a real gain. Pay the bill.

If you're sending dashboards, code, terminal output, or web pages to a coding agent for Q&A or summarization — downscale to 1568 long edge first. You'll pay half the input tokens per image, and across every OCR test I ran, the model extracted the same content either way.

In Python:

from PIL import Image

img = Image.open("screenshot.png")
img.thumbnail((1568, 1568), Image.LANCZOS)
img.save("screenshot-1568.png")

Session replay on 8,373 of my own ~/.claude/projects/ sessions says 3% of eligible coding sessions include images at all. Most Claude Code users aren't feeling this yet. It lands immediately on anyone building visual workflows on the API.

Caveats

The capability tests cover OCR — reading text off images. They don't cover design-fidelity tasks: kerning, pixel-grid alignment, color-accuracy diffs, sub-grid iconography. Those are where 2576px plausibly pays off, and the post doesn't test them.

The public (w × h) / 750 formula holds at 1568px (drift under 4%) but breaks at 2576px, overshooting by up to 36% on near-square images. A 4:1 patch at matched pixel count bills 20% fewer tokens than a square patch. Numbers in the cost tables are measured, not predicted.

Token counts come from count_tokens, not billed inference. Capability probes are billed inference at default settings. Scripts and data are in the repo.

The short version

Claude 4.6 downscaled every image above 1.15MP to fit a ~1,560-token budget. Claude 4.7 raises that ceiling and charges for the pixels you send — roughly 3x more per image at max resolution. The extra pixels matter for computer-use agents navigating UIs by screenshot, for Claude Design reading design references, and for document analysis on dense PDFs. They don't matter for OCR on dashboards, code, or terminal output — 1568 is already plenty.

Figure out which workload you're in, and send images at the resolution that matches.