news

Claude Code Prompt Caching

Sebastien Dubois

14 Apr 2026 — 2 min read

Canonical version: Claude Code Prompt Caching.

Claude Code uses Anthropic's prompt cache to reuse large, stable chunks of context (system prompt, tool definitions, CLAUDE.md, prior turns) across requests instead of resending and re-billing them every turn. Cache reads are cheaper than uncached tokens; cache writes are more expensive than normal tokens. Net savings depend on your usage pattern — context window size, whether the request is the main agent or a subagent, how often a session resumes, etc.

5-minute vs 1-hour cache

Anthropic supports two cache durations. The trade-off is simple: a longer TTL wins when the same cached prefix is reused many times within the window; a longer TTL loses when you pay the higher write price and then never hit the cache again.

5m cache: good for short, interactive bursts. Pays off if the session keeps sending requests with the same prefix within a few minutes.
1h cache: good for long-running or returning sessions. Write is more expensive, so it only pays off when the cached prefix is actually reused during the hour. A single-shot agent invocation with 1h caching is a worst case — you pay the higher write cost and throw the cache away.

Defaults in Claude Code

Claude Code picks the duration per query type, not globally, based on expected reuse patterns:

Subscribers (Pro/Max, logged-in CLI): 1h cache is rolled out as the default in several places where reuse is likely. Subagents and other rarely-resumed queries stay on 5m on purpose, to avoid paying the 1h write premium for caches that will never be reused.
API customers: still default to 5m everywhere. Anthropic wants more real-world data before flipping 1h on by default for API usage.
Coming soon: client-side default is moving to 1h for a few more query types, plus environment variables to force 1h or 5m explicitly. Useful for users who know their workflow better than the heuristic does.

Expected token savings from these heuristics are small — not 12x or anything dramatic. It is a net win on average for the targeted query types, not a step change.

Telemetry interaction

The 1h/5m decision is driven by experiment gates cached client-side. Turning off telemetry (DISABLE_TELEMETRY=1, see Claude Code Configuration) also disables experiment gates — Claude Code does not phone home when telemetry is off — so the client falls back to the hard-coded default, which today is 5m for all query types. If you care about the 1h optimization, keep telemetry enabled. If you care about not phoning home, expect to pay the uncached-prefix cost on queries where 1h would have helped.

Why the nuance matters

Prompt caching is one of the few cost knobs users can actually tune in an agent harness. The same cache duration can be a win, a wash, or a net loss depending on workflow. Main-agent loops with long context windows and many turns benefit most. One-shot subagent calls, short sessions, or workflows that rotate through fresh contexts get less (or negative) value from a longer TTL. Anthropic is rolling 1h out selectively rather than globally for this reason.

Related mechanism at the model level: AI KV Cache (attention-layer cache for generation, different from prompt caching which deduplicates input tokens across requests).

References

Boris Cherny (Claude Code, Anthropic), thread on 1h prompt cache rollout (2026-04-13): https://x.com/bcherny/status/2043715713551212834
Anthropic prompt caching docs: https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching