Prompt caching vs context trimming: which one actually cuts your agent bill?

Prompt caching and context trimming are the two standard answers to an exploding LLM bill, and they're routinely presented as alternatives. They aren't — they optimize different parts of the prompt, they break in different ways, and on coding agents both leave the dominant cost untouched. Here's the honest comparison.

What does each one actually do?

Prompt caching discounts re-sent tokens: if the prefix of your request is byte-identical to a recent one, providers bill the cached part at 0.10× input on Anthropic or 50% off on OpenAI. You still send everything; unchanged tokens just cost less.

Context trimming removes tokens before sending — summarizing history, dropping stale files, compacting the window. Nothing discounted; less sent.

	Prompt caching	Context trimming
Targets	The stable part of the prompt	The stale part of the prompt
Mechanism	Discount on re-sent identical prefix	Don't send it at all
Best case	~90% off the cached portion	100% off the removed portion
Failure mode	Any prefix change re-bills everything after it	Removes something the task still needed
Cost of failure	Silent — bill quietly goes up	Loud — agent loses context, re-reads or errs

Where does caching break?

On change — which is what coding agents do. A cache hit needs an unchanged prefix per model: edit a file that sits early in the prompt and everything after it re-bills fresh; add or modify a tool definition mid-session and the entire cached prefix invalidates. Caching therefore rescues the static head (system prompt, instructions, rules files) and quietly fails on the moving tail — the files being edited, the growing history. That's the part agents grow fastest.

The trap is its silence. Caching failures don't error; they just bill at full rate, which is why teams "with caching enabled" still see quadratic session costs — the growth lives exactly where the cache can't reach.

Where does trimming break?

On fidelity. Trimming is a bet that the removed tokens won't be needed; lose the bet and the agent re-reads what was dropped (paying twice) or proceeds without it (worse). Summarized history keeps conclusions and drops reasoning — early decisions and constraints are the classic casualties, and compaction can fail outright on very long sessions right when the accumulated context was most valuable.

Trimming also doesn't compound with caching as cleanly as the slide-deck version suggests: trimming changes the prompt, which invalidates the cache. Aggressive trimming on every turn can produce a prompt that never caches at all.

So which one should you use?

Both — on the parts they're good at — and neither as the primary fix. The working split: keep a small, stable head (system prompt, lean rules) and let caching cover it; trim deliberately at task boundaries (fresh session per task) rather than continuously, so the cache survives within a task; and attack the real driver separately.

The real driver, measured: ~76% of agent tokens are reads and navigation. Caching discounts re-reads; trimming evicts them after they were paid for. Only changing what the agent reads in the first place — the slice instead of the file, the answer instead of the grep dump — removes them, which is the approach our benchmark measures at −86%/−90% and the four-meter model puts in context. It's also cache-friendly: a stable map serving small slices churns the prompt far less than full-file reads do — try the difference on your own workload.

What does each one actually do?

Where does caching break?

Where does trimming break?

So which one should you use?

See it on your own repo