Skip to content
← All posts
Comparisons
June 6, 2026 · Updated June 12, 2026 · 3 min read

Prompt caching vs context trimming: which one actually cuts your agent bill?

Prompt caching and context trimming are the two standard answers to an exploding LLM bill, and they're routinely presented as alternatives. They aren't — they optimize different parts of the prompt, they break in different ways, and on coding agents both leave the dominant cost untouched. Here's the honest comparison.

What does each one actually do?

Prompt caching discounts re-sent tokens: if the prefix of your request is byte-identical to a recent one, providers bill the cached part at 0.10× input on Anthropic or 50% off on OpenAI. You still send everything; unchanged tokens just cost less.

Context trimming removes tokens before sending — summarizing history, dropping stale files, compacting the window. Nothing discounted; less sent.

Prompt cachingContext trimming
TargetsThe stable part of the promptThe stale part of the prompt
MechanismDiscount on re-sent identical prefixDon't send it at all
Best case~90% off the cached portion100% off the removed portion
Failure modeAny prefix change re-bills everything after itRemoves something the task still needed
Cost of failureSilent — bill quietly goes upLoud — agent loses context, re-reads or errs

Where does caching break?

On change — which is what coding agents do. A cache hit needs an unchanged prefix per model: edit a file that sits early in the prompt and everything after it re-bills fresh; add or modify a tool definition mid-session and the entire cached prefix invalidates. Caching therefore rescues the static head (system prompt, instructions, rules files) and quietly fails on the moving tail — the files being edited, the growing history. That's the part agents grow fastest.

The trap is its silence. Caching failures don't error; they just bill at full rate, which is why teams "with caching enabled" still see quadratic session costs — the growth lives exactly where the cache can't reach.

Where does trimming break?

On fidelity. Trimming is a bet that the removed tokens won't be needed; lose the bet and the agent re-reads what was dropped (paying twice) or proceeds without it (worse). Summarized history keeps conclusions and drops reasoning — early decisions and constraints are the classic casualties, and compaction can fail outright on very long sessions right when the accumulated context was most valuable.

Trimming also doesn't compound with caching as cleanly as the slide-deck version suggests: trimming changes the prompt, which invalidates the cache. Aggressive trimming on every turn can produce a prompt that never caches at all.

So which one should you use?

Both — on the parts they're good at — and neither as the primary fix. The working split: keep a small, stable head (system prompt, lean rules) and let caching cover it; trim deliberately at task boundaries (fresh session per task) rather than continuously, so the cache survives within a task; and attack the real driver separately.

The real driver, measured: ~76% of agent tokens are reads and navigation. Caching discounts re-reads; trimming evicts them after they were paid for. Only changing what the agent reads in the first place — the slice instead of the file, the answer instead of the grep dump — removes them, which is the approach our benchmark measures at −86%/−90% and the four-meter model puts in context. It's also cache-friendly: a stable map serving small slices churns the prompt far less than full-file reads do — try the difference on your own workload.


Related: token optimization for coding agents · RAG vs code graph · what is context ops · reduce Claude Code costs.

See it on your own repo

Free to start. One install, your codebase, real numbers.