Skip to content
← All posts
Guides
June 6, 2026 · Updated June 12, 2026 · 5 min read

Token optimization for AI coding agents: the complete guide

Token optimization for AI coding agents starts from one measured fact: agents spend most of their tokens reading your code, not writing it. Published research puts read-type operations at 76.1% of total agent tokens; the working range across studies is 60–80%. Every optimization that ignores this — cheaper models, trimmed prompts, cached prefixes — optimizes the minority of the bill. This pillar covers the full mechanics: where tokens go, why costs grow quadratically, what caching does and doesn't fix, and the levers ranked by impact.

Why do AI coding agents burn so many tokens?

Because an agent is a loop, not a chat. Each task spawns dozens to hundreds of model roundtrips — search, read, think, edit, test, repeat — and the model is stateless, so every roundtrip re-sends the entire accumulated context. Tokens scale with loop length × context size, and both grow as the task proceeds.

A chat question costs what it costs once. An agent task at step 50 is carrying everything from steps 1–49. That difference — the loop — is why a $0.02 question and a $8 agent task use the same API. The four meters that price every loop (roundtrips, model, cache, context size) are broken down on our LLM API optimization page, with an interactive model you can set to your own workload.

Where do the tokens actually go?

To navigation and reading. The SWE-Pruner measurements on real agent workloads: file and directory inspection 76.1% of tokens, code execution 12.1%, editing 11.8%. The agent's main activity is rediscovering your codebase — list, open, grep, re-open.

Where agent tokens goShare
Reading & navigating code~76%
Executing / testing~12%
Editing code~12%

This is the stat that should set your optimization priorities. It's also the one we benchmarked: serving the same information needs from a structural map of the code instead of raw file reads cut navigation tokens 86% and read tokens 90%, fidelity-gated with a real tokenizer.

Why does cost grow quadratically over a session?

Because stateless APIs re-send the full conversation history on every call, cumulative input tokens grow with the square of session length. A 20-step loop at 1,000 tokens per step bills about 210,000 cumulative input tokens — not 20,000. At 50 steps the multiplier passes 30×.

This is the mechanism behind every "why is my bill so high" thread. The bill isn't high because the agent did a lot — it's high because everything the agent did travels with every subsequent request. It's also why the cheapest intervention is often the dumbest-sounding one: start a fresh session. And it's why every full file the agent reads taxes not just that step but every step after it.

Does prompt caching fix it?

Partially — and it breaks exactly when coding agents work hardest. Cache reads are cheap (0.10× input on Anthropic, 50% off on OpenAI), but a cache hit requires an unchanged prefix — and coding agents change their context constantly.

Every file edit invalidates the cache for that file and everything after it in the prompt. Adding or changing a tool mid-session invalidates the entire prefix (arXiv: "Don't Break the Cache"). So caching rescues the stable head of the prompt (system prompt, instructions) and quietly fails on the moving tail — which is where agent work happens. Caching is a real lever; it is not the structural fix. The structural fix is sending less context in the first place — see prompt caching vs context trimming for when each applies.

Should you just use a cheaper model?

Sometimes — the price spread is real (Haiku $1/$5, Sonnet $3/$15, Opus $5/$25 per million tokens: a 5–25× routing spread, wider on output). But cost per token is the wrong metric; cost per successful task is the right one. Frontier models finish complex multi-file tasks in ~60% of the loops smaller models need — and a cheap model that fails midway wastes every token it spent.

The honest framing: model choice is the widest price lever and an unreliable cost lever. Downgrade the model for tasks where failure is cheap and verifiable (formatting, boilerplate, simple edits); keep the frontier model where a failed attempt costs you the whole loop. Routing without a quality signal just relocates the spend.

What does AI coding actually cost per developer?

The published range is wide because workloads are. Anthropic reports ~$6/developer/day average for Claude Code, with enterprise deployments at $150–250/month per developer. Heavy agentic users commonly run $400–1,500/month, with extremes past $4,000 in days. A 10-person team using agents heavily can clear $100K/year.

Org-level, 37% of enterprises now spend over $250K/year on LLM APIs and 72% expect bills to climb. The per-agent specifics differ — Claude Code's plans and limits, Cursor's credit pools, Copilot's 2026 move to usage-based AI Credits — but the direction is uniform: every major agent now bills, directly or indirectly, by tokens. Token efficiency stopped being a nice-to-have and became the price of the tool.

The playbook — and what standard advice misses

Ranked by what it attacks:

  1. Fix navigation (attacks the 76%). Give the agent a structural way to know the codebase — the function, not the file; the callers, not a grep dump. This is unerr's lever: −86% navigation / −90% read tokens, measured, served locally to every MCP-compatible agent.
  2. Keep loops short (attacks the quadratic). Fresh sessions per task; scoped context; compact before the auto-trigger.
  3. Cache what's stable (attacks the prefix). Stable system prompts and tool sets; don't churn tools mid-session.
  4. Route models with a quality gate (attacks the rate). Cheap models only where failure is cheap.
  5. Measure per task, not per token. Attribute spend by repo/workflow/team and find the structural leaks — the FAQ covers how unerr reports this at team level.

Standard advice stops at 2–4 because they're things you can do with settings. The 76% needs tooling — which is why it stays unfixed on most teams, and why it's the first thing to check when estimating your own waste.


Figures cited as of June 2026. Related guides: reduce Claude Code costs · Cursor pricing explained · the benchmark methodology · unerr pricing.

See it on your own repo

Free to start. One install, your codebase, real numbers.