Definition

Prompt caching is an API-level optimization where the provider stores the attention key-value (KV) cache computed for a prompt prefix. When a subsequent request starts with the same prefix (up to the marked breakpoint), the provider skips recomputing attention for those tokens and reads from the cache instead.

In the Anthropic API, prompt caching is opt-in. Mark the end of a cacheable prefix with "cache_control": {"type": "ephemeral"} on the last content block of the prefix. Cache lifetime is currently 5 minutes, refreshed on each cache hit. Cached tokens are billed at 10% of the normal input token cost; there is a small additional cost for writing to the cache.

Typical cacheable prefixes:

  • Long system prompts (tool definitions, instructions, persona).
  • Large static documents attached to every request (codebases, manuals, corpora).
  • Few-shot examples.

Cache hits require an exact byte-for-byte prefix match. Any change to the prefix before the cache breakpoint invalidates the cache. The minimum cacheable token count is 1,024 tokens. Caching short prefixes has no effect.

When it applies

Use prompt caching when:

  • The system prompt plus tools exceeds roughly 1,000 tokens and is reused across requests.
  • A document is attached to many user questions in the same session.
  • A multi-turn agent loop re-sends the same large context each turn.

Do not cache dynamic content (timestamps, request IDs, user-specific data) before the breakpoint; it breaks prefix matching.

Example

import anthropic
 
client = anthropic.Anthropic()
 
response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    system=[{
        "type": "text",
        "text": "..." * 2000,  # > 1024 tokens
        "cache_control": {"type": "ephemeral"}
    }],
    messages=[{"role": "user", "content": "Summarize section 3."}]
)
 
usage = response.usage
print(f"Cache read: {usage.cache_read_input_tokens}")
print(f"Cache write: {usage.cache_creation_input_tokens}")
  • context-window - prompt caching is most impactful when the context window is large.
  • token - cache hits reduce the effective input token cost to 10%.
  • system-prompt - system prompts are the most common caching target.
  • few-shot-prompting - few-shot examples benefit from caching when reused across turns.
  • prompt-design - the prompting deep-dive including caching strategy.

Citing this term

See Prompt Cache (llmbestpractices.com/glossary/prompt-cache).