Overview

Prompt caching cuts latency by roughly 50 to 80 percent and cost by roughly 90 percent on the cached portion of the prompt. The mechanism is simple: the provider hashes a prefix of the prompt and reuses the computed KV cache when the same prefix appears on the next call. Maximizing the hit rate is a matter of prompt layout: stable content first, mutable content last, deterministic chunk order in between.

Keep the longest cacheable prefix stable

The cache key is the prefix. Anything you change in the first N tokens invalidates the cache for every subsequent call. Order the prompt so the most stable content sits at the top.

The standard layout for a cached prompt:

1. System message and policy            (stable, rarely changes)
2. Few-shot examples                    (stable across requests)
3. Tool definitions                     (stable across requests)
4. Retrieved context (RAG documents)    (changes per request, but reorderable)
5. Conversation history                 (grows, but only appends)
6. Current user query                   (changes every request)

The first three regions form the cacheable prefix. The provider caches them once and reuses the cached KV state on every subsequent call that starts with the same bytes.

Place mutable per-request content at the END of the prompt

A timestamp, a session ID, or “today is 2026-05-15” inserted at the top of the prompt destroys the cache. Move all per-request mutable content to the bottom of the prompt where it does not invalidate the prefix.

Bad:  System: You are a helpful assistant. Today is 2026-05-15.
      User: What is the weather?

Good: System: You are a helpful assistant.
      User: What is the weather? (Today is 2026-05-15.)

The “today is” string moves from the system message (cache-invalidating) to the user message (always uncached). The semantic information is preserved; the cache survives.

Sort retrieved chunks deterministically

In a RAG pipeline, the natural order to feed retrieved chunks to the model is by relevance score (most relevant first). For cache-hit purposes, that order is the worst possible choice: relevance order changes between similar queries and breaks the prefix on every call.

Sort retrieved chunks by a stable key (chunk ID, document timestamp, alphabetical) instead. The model’s ability to use the chunks does not depend on order; the cache hit does.

Bad:  sorted by relevance score (changes per query)
Good: sorted by chunk ID (stable across queries)

For the small set of queries where relevance order genuinely improves output quality (long-context degradation, lost-in-the-middle effects), make the call explicitly: trade the cache hit for the quality bump. Most RAG calls do not need it.

Pin the model version

Prompt caches are scoped per model version. claude-opus-4-7 and claude-opus-4-8 have separate caches. A model upgrade resets the hit rate to zero on the first call to the new version.

Pin the model version in the API call rather than using the floating alias (claude-opus without a version). When you upgrade:

  1. Run the eval set against the new model first.
  2. Cut over deliberately during a low-traffic window so the cache warms up before peak load.
  3. Track the cache hit rate before and after; expect a brief dip while the new cache fills.

Use the provider’s explicit cache control where available

Provider APIs increasingly expose explicit cache controls. Use them; do not rely on implicit prefix-matching alone.

  • Anthropic Claude: cache_control: { type: "ephemeral" } markers on message blocks. Mark the system message, the tool list, and the few-shot examples as cached. Five-minute default TTL.
  • OpenAI: automatic prefix caching for prompts over 1024 tokens. No explicit annotation; layout is the only lever.
  • Google Gemini: explicit cached-content objects you create with cachedContents.create and reference by name. Useful for very long stable contexts.

Explicit markers let you cache the right portion. Without them you depend on the provider to find the boundary, which is fragile.

Measure cache hit rate as a deployment metric

Cache hit rate is a first-class production metric. Track it per endpoint, per prompt version, per model version. A drop in hit rate is a regression, usually caused by:

  • A new field added at the top of the system message.
  • A non-deterministic ordering of retrieved chunks.
  • A model upgrade.
  • A logging or middleware change that inserts a request ID into the prompt.
endpoint=/classify  prompt=v3.b  model=claude-opus-4-7
hit_rate_24h: 0.84
hit_rate_1h:  0.62   (regression)

Alert on a sustained drop. The alert is usually faster than the cost dashboard.

Pitfalls

  • Putting the current timestamp or user ID at the top of the prompt. The prefix changes every call and the hit rate is zero.
  • Sorting RAG chunks by relevance score when the prompt would also be fine with deterministic order.
  • Using a floating model alias instead of a pinned version. Cache resets silently when the provider rolls forward.
  • Editing the system prompt in production without a version bump. Old cached prefixes go stale on the next call.
  • Caching the user query. The whole point is to cache the stable prefix; per-user content goes at the end.