How tool use actually works in Claude Code

When people talk about Claude Code, they usually talk about the model — how smart it is, which version they're using, whether Opus is worth the cost over Sonnet.

But Claude the model just generates text. It can't read your files, run your tests, or edit your code. The thing that makes Claude Code useful is the harness: the software that wraps around the model, gives it tools, and runs whatever it asks for.

The model decides what to do. The harness makes it happen. Tool use is the protocol between them.

Tool use is the API that connects them. Every file read, every code edit, every grep search in Claude Code is a tool call. The product is a while loop around the tool use API. There's real complexity around that loop (permissions, context compaction, the memory system, plan mode) but the loop is what everything else sits on.

Once you understand tool use, a lot of Claude Code's behavior starts making sense. How it navigates a codebase it's never seen. Why a simple question costs four API calls and a complex one costs thirty. Why it outperforms RAG-based approaches without pre-indexing anything. In each case, the model picks the right tools in the right order, and the harness runs them.

I ran four experiments and traced three live sessions to see the mechanics, the token costs, and how different models use tools differently.

What is a tool?

On its own, an LLM can't do anything. It can't open files, run commands, or search your code. A tool gives it a way to act. You define a function (like "read this file" or "run this shell command"), tell the model it exists, and the model can ask to use it.

The key word is "ask." The model never executes anything directly. It says "I'd like to read src/auth.py" and your code decides whether to allow it, runs the read, and sends the contents back. The model sees the result and decides what to do next.

A script follows a fixed sequence. Tool use lets the model decide which tools to call, in what order, based on what it finds. The model drives the process.

How tool use works (the API)

Quick primer if you need it: a "token" is roughly a word or piece of a word. Claude's "context window" is its working memory - 200K tokens, or roughly 150K words. Everything the model needs to think about (your conversation, tool definitions, file contents) has to fit in that window. When it fills up, things get dropped. Tokens also drive cost. You pay per token in and out.

When you send a message to Claude through the API, you can include a list of tools the model is allowed to call. Each tool has a name, a description, and the parameters it accepts.

The model doesn't execute anything. It returns a structured request saying "I want to call this tool with these arguments." Your code executes the tool, sends the result back, and the model continues.

Here's the full cycle:

The tool use loop - send tools, model calls one, you execute, send result back, repeat until end_turn

1. You send: tools[calculator] + user message
2. Claude responds: stop_reason="tool_use", content=[{type: "tool_use", name: "grep", input: {pattern: "auth"}}]
3. You execute: grep("auth") → "src/auth.py:7: def authenticate(user)..."
4. You send back: {type: "tool_result", tool_use_id: "...", content: "src/auth.py:7: ..."}
5. Claude responds: either another tool_use (back to step 2) or stop_reason="end_turn" (done)

That's it. Steps 2-4 repeat in a loop until the model decides it has enough information and returns end_turn. The model picks which tool to call, what arguments to pass, and when to stop. Your code just executes what it asks for and feeds results back.

Claude Code's entire architecture is this loop with 23 tools.

Free Claude Code crash course

60-min video lesson + CLAUDE.md starter kit. Yours when you subscribe.

The experiments

I wrote a script that makes real API calls and shows you tool use behavior step by step. Then I traced live Claude Code sessions with a PostToolUse hook to see what the product actually does.

Experiment 1: seeing the loop

One tool (a calculator). One question: "What is 1,984,135 times 9,343,116?"

Step 1: Send request with tool definition + user message

Step 2: Claude responds with stop_reason="tool_use"
  Tool call: calculator({"expression": "1984135 * 9343116"})

Step 3: We execute the tool locally
  Result: 18538003464660

Step 4: Send tool_result back to Claude

Step 5: Claude responds with stop_reason="end_turn"
  "The result of 1,984,135 x 9,343,116 is 18,538,003,464,660."

Token usage:
  Request 1: 621 in / 60 out
  Request 2: 698 in / 33 out
  Total:     1,319 in / 93 out
  API calls: 2

Two API calls. The first to get the tool call, the second to get the answer. The model chose to use the calculator rather than attempt the multiplication itself (which it would likely get wrong).

This is the smallest possible agentic loop. One tool, one call, done.

Experiment 2: building a mini Claude Code

Three tools: list_files, search_code, and read_file. A simulated codebase with four Python files and one planted bug — a login function that compares a plaintext password directly against a hash. I asked the model: "Find the authentication bug."

Turn  1: list_files(".")        → src/
Turn  2: list_files("src")      → auth.py, database.py, api.py, utils.py
Turn  3: read_file("auth.py")   → 25 lines
         read_file("database.py") → 26 lines
         read_file("api.py")    → 22 lines
         read_file("utils.py")  → 20 lines
Turn  4: search_code("encode()") → auth.py:13
Turn  5: search_code("hashed_password") → 6 matches
...
Turn 13: [end_turn] → "The comparison operator is backwards in login()"

Summary:
  Tool calls: 15
  API turns:  13
  Tokens:     35,971 in / 2,458 out
  Cost:       $0.14

Thirteen API round-trips. Fifteen tool calls. The model listed files, read all four, then ran targeted searches to confirm the bug. It followed the same pattern Claude Code uses: orient (list files) → read (get context) → search (narrow down) → report.

Look at turn 3: the model called read_file on all four files in parallel. When the model returns multiple tool_use blocks in a single response, you execute them all and send all the results back together.

$0.14 to find a bug in a 4-file toy codebase. This is a controlled experiment to show the mechanics — real codebases are larger and costs scale with the number of files read and searches run.

Experiment 3: model comparison

Same tools. Same codebase. Same bug. Three models: Claude Haiku 3.5, Claude Sonnet 4, Claude Opus 4.

Same task, three models. I wanted to see how the approach changes.

Three models, same bug — Haiku searches first, Sonnet reads first, Opus does both multiple times

Model      Tool Calls    API Turns    Input Tokens    Cost       Time
──────────────────────────────────────────────────────────────────────
Haiku      13            8            15,883          $0.02      16s
Sonnet     8             5            7,402           $0.04      28s
Opus       38            22           116,645         $0.73      115s

(These experiments send isolated API calls with no shared conversation history, so prompt caching doesn't kick in. In a real Claude Code session, the system prompt and tool definitions stay the same across turns and get cached automatically — charged at 10% of the normal rate. Real sessions cost significantly less. The numbers here are for
comparing model behavior, not estimating your bill.)

Sonnet made 8 tool calls across 5 turns. Read all four files, ran two targeted searches, done. The most efficient path to the answer.

Haiku made 13 tool calls across 8 turns. Searched first (login, password, authenticate), then read the files. More searching, less reading. It costs $0.02 — a penny per bug.

Opus made 38 tool calls across 22 turns. It read every file, searched for password patterns, then re-read files — possibly to verify against new hypotheses it had formed. It searched for schemas, test files, configuration, migration scripts — none of which existed. Then it searched again. It found the same bug the others did, but it spent $0.73 and two minutes being thorough about it.

This matches what the Claude Code team has said publicly: Opus is more thorough. It checks edge cases, looks for secondary bugs, and validates its findings more aggressively. Whether that's worth 37x the cost depends on what you're doing.

The tool call sequences show the difference:

Haiku:  list → list → search×3 → read×4 → search → end
Sonnet: list → list → read×4 → search×2 → end
Opus:   list → list → read×4 → search×5 → list → search×3 →
        read → read → read → search×2 → search×3 → read →
        search×2 → search → search → search → list → list →
        list → end

Sonnet reads first, then searches to confirm. Haiku searches first, then reads what it finds. Opus does both, multiple times.

Experiment 4: the token tax

Every tool you define costs tokens — it takes up space in Claude's working memory. I measured the overhead by sending the same "Hello" message to Claude Sonnet 4 with 0, 1, 5, 10, and 20 tools defined (each with a simple schema of comparable complexity).

Token overhead per tool — ~150 tokens each at scale, plus MCP server impact

Tools    Input Tokens    Overhead
──────────────────────────────────
0        10              baseline
1        633             +623
5        1,137           +225/tool
10       1,767           +176/tool
20       3,027           +151/tool

Claude Code defines 23 built-in tools. That's roughly 3,000 tokens of overhead on every API call, before you type a single character.

Now add MCP servers (plugins that give Claude Code extra capabilities). Each one adds its own tool definitions. Thariq Shihipar estimated that a typical setup with 5 MCP servers can add ~55K tokens — that's 28% of the 200K context window consumed by tool definitions alone. (The actual overhead depends on which servers and how many tools each exposes.)

This is why Thariq advocates "Bash Is All You Need" for many integrations. The gh CLI for GitHub operations: 0 tokens in tool definitions, because the model already knows how to use gh. A GitHub MCP server: ~26K tokens. Same capability, 26K difference.

Claude Code mitigates this at scale with Tool Search — when MCP tools exceed 10% of context, it loads lightweight stubs and lets the model discover full schemas through a search tool. Overhead drops from 77K to 8.7K tokens. 85% reduction.

Tracing real sessions

I wanted to see what Claude Code actually does with these tools. Not the API experiments — the real product, doing real tasks.

Claude Code has hooks — lifecycle events you can tap into. I wrote a PostToolUse hook that logs every tool call to a JSONL file: which tool, what arguments, when.

I ran four tasks through claude -p (headless mode) with the hook active.

Task: "Create a Python script that prints fibonacci numbers, then run it."

1. Write   → /tmp/fibonacci.py
2. Bash    → python3 /tmp/fibonacci.py

Two tool calls. Write the file, run it. No searching, no reading. The model knows what to write.

Task: "Read scripts/tool-use-experiments.py, find the auth bug, fix it."

1. Read    → scripts/tool-use-experiments.py
2. Edit    → scripts/tool-use-experiments.py

Two tool calls. Read the file, edit the bug. When the model knows exactly which file to look at, it skips search entirely.

Task: "How many commits in the last 7 days? Who are the authors?"

1. Bash    → git log --since="7 days ago" --oneline --format="%h %an %s" &&
             git log --since="7 days ago" --format="%an" | sort | uniq -c | sort -rn

One tool call. A single chained bash command that answers both questions. The model composed two git commands with pipes rather than making separate calls.

Task: "Find all Python files that import anthropic and tell me which models they use."

1. Grep    → pattern: "import anthropic"
2. Grep    → pattern: "model="
3. Grep    → pattern: "model=" (different file filter)
4. Grep    → pattern: "^MODEL\s*="
5. Grep    → pattern: "^MODEL\s*=" (different file filter)

Five searches. The model started broad (import anthropic), then narrowed to model= parameters, then tried case variations. Each Grep result refined the next query.

Compare these four traces:

Task	Pattern	Tool Calls
Create + run	Write → Bash	2
Fix known bug	Read → Edit	2
Git analysis	Bash (chained)	1
Cross-file search	Grep → Grep → Grep → Grep → Grep	5

The model adapts its strategy to the task. When it knows where to look, it goes straight there. When it doesn't, it searches iteratively. When the answer requires execution, it composes shell commands. No fixed playbook.

Why this beat RAG for code

Before tool use existed, if you wanted a model to answer questions about your code, you built a RAG pipeline. Split your code into chunks, create embeddings (numerical representations of meaning), store them in a vector database, retrieve relevant chunks for each question, stuff them into the prompt.

Claude Code threw that away. It gives the model grep and says "find it yourself."

Grep alone doesn't beat RAG. Grep inside a loop does, because the model can refine queries, read full files, cross-reference imports, and hold 200K tokens of context at once.

The simplest reason it works: there's no infrastructure. No embedding pipeline, no vector database, no sync jobs, no re-indexing when files change.

It also reads actual files, not chunks. RAG chunks can be stale, out of context, or split across semantic boundaries. Tool use reads the file as it exists right now on disk, so there's never stale data. In a fast-moving codebase, that matters.

The bigger reason is that tool use is adaptive. My traces show the pattern: the model searches, reads a file, notices an import, searches for that import, reads the imported file. RAG gives you one-shot retrieval. Tool use gives you an iterative conversation with the filesystem.

Context windows made this possible. When context was 4K tokens, RAG was the only way to give a model access to a large codebase. At 200K, the model can hold dozens of full files simultaneously. My Experiment 2 read all four files in a single turn.

And the model understands code well enough to write its own search queries. It can write regex patterns for function definitions, class hierarchies, import chains, because it knows the language grammar.

The honest tradeoff

Tool use burns more tokens per exploration. Each search result goes into context. A Grep returning 50 matches across 20 files consumes 5-10K tokens easily. My Experiment 3 shows Opus burning 116K tokens to find a single bug.

Agentic search also has real failure modes. In monorepos with 100K+ files, grep returns too many results and context fills up before the model can narrow down effectively. When functions have non-obvious names (xz_process_data instead of authenticate), text search won't find them — but embeddings that capture semantic similarity might. Cross-language codebases add another layer of difficulty.

The hybrid approaches are already showing results. Milvus built "Claude Context" with vector search alongside Claude Code and claims 40% token reduction. Cursor's research shows combining grep with semantic search improves accuracy by 12.5% over grep alone.

But Claude Code shows you don't need the complexity for most codebases. And there's a deeper bet: as models get smarter, agentic search gets better for free. The model learns to write better grep queries without any infrastructure changes on your end. RAG pipelines also benefit from better models (better embeddings, better chunking), but they still require the infrastructure. Agentic search improves with zero maintenance.

Programmatic tool calling: the loop inside the loop

Anthropic is already working on fixing the token cost problem.

Programmatic tool calling (PTC) changes the architecture I described above. Instead of every tool result returning to Claude's context window, Claude writes code that orchestrates multiple tool calls inside a container. The intermediate results stay in the code. Only the final output reaches Claude.

Here's the difference. In standard tool use, three search queries means three round-trips through context:

Turn 1: Claude → web_search("query A") → result A added to context
Turn 2: Claude → web_search("query B") → result B added to context
Turn 3: Claude → web_search("query C") → result C added to context
Turn 4: Claude reasons over all three results in context

With PTC, Claude writes a program that runs all three searches, filters the results, and returns only what matters:

Turn 1: Claude writes code →
          result_a = await web_search("query A")
          result_b = await web_search("query B")
          result_c = await web_search("query C")
          return summarize(result_a, result_b, result_c)
        → only the summary reaches Claude's context

The tool calls still cross the sandbox boundary. Your tool handlers still run on your side — they can inspect, reject, log, or queue for human approval. But the raw results don't bloat the context window.

The while loop is still there. But there's now a second loop inside it that doesn't pay the context tax.

Claude Code's 23 tools

The full set, verified by looking at the tool definitions in a live session:

Category	Tools	Count
File ops	Read, Edit, Write, NotebookEdit	4
Search	Glob, Grep	2
Execution	Bash	1
Web	WebFetch, WebSearch	2
Orchestration	Task, TaskOutput	2
Task management	TaskCreate, TaskGet, TaskList, TaskUpdate, TaskStop	5
Teams	TeamCreate, TeamDelete, SendMessage	3
User interaction	AskUserQuestion	1
Mode control	EnterPlanMode, ExitPlanMode	2
Extensions	Skill	1

7 of these — Read, Edit, Write, Glob, Grep, Bash, Task — handle roughly 95% of coding work. The rest is orchestration.

Vikash Rungta's reverse engineering nailed it: "~50 lines of loop logic + a shell gives you infinite surface area. Don't build 100 tools."

Bash alone is worth every other integration tool combined. The model knows gh, curl, docker, kubectl, jq, git. Each one adds zero tokens to the tool definitions because it's just a bash command. A GitHub MCP server adds 26K tokens to give you the same capabilities.

Why these things are tools and not bash commands

If bash can do almost anything, why have 23 tools at all? Thariq Shihipar laid out the reasoning: promoting an action to a dedicated tool makes sense when you need a control surface around it.

AskUserQuestion is a tool because the harness needs to catch it and render a modal. A bash command can't trigger a UI element. The Edit tool runs a staleness check, verifying the file hasn't changed since the model last read it. Bash sed would just overwrite whatever's there. Read-only tools like Grep and Glob can run in parallel safely because the harness knows they're read-only from the tool declaration. It can't infer that from an arbitrary bash command.

There's also observability: when every file read is a Read tool call, you can measure latency, count usage, and log what the model looked at. My trace hook works because tool calls are structured events. You can't hook into cat the same way. And tools that can be undone (like Edit, which preserves the old content) can be approved more freely, because the harness groups actions by risk level based on which tool was called.

Bash is the default. You promote something to a tool when you need the harness to see it, control it, or render it. Everything else stays in bash.

Try it yourself

Trace your own Claude Code sessions:

# 1. Make the hook executable
chmod +x scripts/tool-use-trace-hook.sh

# 2. Add to .claude/settings.local.json under "hooks":
#    "PostToolUse": [{
#      "matcher": "",
#      "hooks": [{ "type": "command", "command": "/path/to/tool-use-trace-hook.sh" }]
#    }]

# 3. Use Claude Code normally, then view the trace:
jq -r '[.timestamp, .tool, .input_summary] | @tsv' /tmp/claude-tool-trace.jsonl

# 4. Count tool usage:
jq -r '.tool' /tmp/claude-tool-trace.jsonl | sort | uniq -c | sort -rn

Run the API experiments:

export ANTHROPIC_API_KEY=sk-ant-...

# See the basic loop
python scripts/tool-use-experiments.py 1

# Watch an agent search a codebase
python scripts/tool-use-experiments.py 2

# Compare Haiku vs Sonnet vs Opus (costs ~$0.80)
python scripts/tool-use-experiments.py 3

# Measure tool definition overhead (no API calls)
python scripts/tool-use-experiments.py 4

The whole thing is a while loop. The model calls a tool, you execute it, feed the result back, repeat until it says end_turn. Everything else in Claude Code, from the permission system to context compaction, is built around that loop.

The numbers worth remembering: each tool definition costs ~150 tokens, so 5 MCP servers can eat 28% of your context window before you type anything. Sonnet finds bugs in 5 turns for $0.04, Opus finds the same bugs in 22 turns for $0.73, and Haiku does it in 8 turns for $0.02. For most codebases, grep inside this loop beats an embedding pipeline because it's simpler, always current, and improves as models improve. For very large codebases (50K+ files), hybrid approaches that add semantic search are worth considering.

Programmatic tool calling is the next step. Claude orchestrates multiple tool calls in code, and only the final result hits context. Early web search benchmarks show 11% better accuracy with 24% fewer tokens. How much that helps for code tasks is still an open question.