Skip to content
← All posts
Guides
June 6, 2026 · Updated June 12, 2026 · 3 min read

How to reduce LLM API costs: the four meters that price every call

To reduce LLM API costs, you first need to know what an LLM bill is physically made of — and it's four meters, not one: how many roundtrips you make, which model serves them, whether the prefix hits cache, and how large the context has grown. Every product built on an LLM API pays on all four, they multiply rather than add, and most optimization advice tunes exactly one. This guide walks the meters and the levers, with the published numbers for each.

What actually determines an LLM API bill?

Cost ≈ roundtrips × (input tokens × input rate + output tokens × output rate), with cached input discounted. Output is the expensive direction — 4–8× input across 2026 flagships — but input dominates volume, because the full context re-sends on every roundtrip.

MeterWhat moves itTypical range
RoundtripsLoop/agent design, retries1 → hundreds per task
Model rateTier choice5–25× spread (e.g. $1/$5 vs $25/$75 per M)
Cache statePrefix stabilitycached input ~0.10–0.5× fresh
Context sizeWhat you put in the windowthe multiplier on everything

We keep a fuller, interactive version of this breakdown on the LLM API usage optimization page — including a cost model you can set to your own workload.

How much does prompt caching save?

A lot, when the prefix is stable: Anthropic bills cache reads at 0.10× the input rate (writes cost 1.25–2.0×), and OpenAI applies an automatic 50% discount on prompts past 1,024 tokens. For a chat product with a fixed system prompt, caching is nearly free money.

The fine print: a cache hit requires a byte-stable prefix, per model. Change anything early in the prompt — a tool definition, a document, an edited file — and everything after it re-bills fresh (arXiv: "Don't Break the Cache"). Workloads whose context changes as part of the work (agents, anything editing its own inputs) cache poorly exactly where they spend the most — the tradeoffs are unpacked in prompt caching vs context trimming.

Is switching to a cheaper model the answer?

It's the widest single lever — the tier spread runs 5–25× (Haiku $1/$5 · Sonnet $3/$15 · Opus $5/$25 per million tokens) — and the most commonly misapplied. Cost per token is the wrong metric: cost per successful task is the right one, and a cheap model that fails re-runs the whole job.

The honest version of model routing: downgrade where failure is cheap and verifiable, keep the frontier model where a failed attempt costs the entire loop. Routing without a quality signal relocates spend; it doesn't reduce it.

Why do agentic workloads cost so much more than chat?

Because agents multiply the meters. A chat turn is one roundtrip; an agent task is dozens to hundreds, each carrying the full accumulated context — so cumulative input grows with the square of loop length. A 20-step loop at 1,000 tokens/step bills ~210,000 cumulative input tokens, not 20,000.

And for coding agents specifically, ~76% of those tokens are spent reading and navigating code — context that often re-enters the window session after session. That's why coding-agent bills ($400–1,500/month for heavy users; org LLM spend over $250K/year for 37% of enterprises) respond more to context engineering than to model choice. The agent-specific playbook is in token optimization for coding agents.

What's the right order of operations?

Attack the multiplier first, then the rates:

  1. Shrink what enters the context. Serve the slice, not the document; the function, not the file. For code, a structural map makes this exact — measured at −86% navigation / −90% read tokens, fidelity-gated.
  2. Shorten the loops. Fewer roundtrips beat cheaper ones; every step taxes every later step.
  3. Stabilize the prefix. Put the static material first and keep tools steady so caching actually fires.
  4. Route models with a quality gate. Last, because it's the only lever that can silently trade away output quality.

One meter at a time leaves money on the table because they multiply: halving context while also halving roundtrips quarters the bill. The full mechanism — and how unerr applies it as one loop across the agents your team already runs — is on the optimization page; pricing is flat-rate, so the savings don't bill by usage.


Rates cited as of June 2026 — provider pricing changes; check the linked docs. Related: token optimization for coding agents · reduce Claude Code costs · Cursor pricing explained · the benchmark.

See it on your own repo

Free to start. One install, your codebase, real numbers.