Semantic cache

Overview

This page is the atomic definition. Caching and cost reduction patterns live at embeddings.

Definition

A semantic cache stores language model responses keyed by a vector embedding of the prompt rather than an exact string match. When a new prompt arrives, its embedding is computed and compared against cached prompt embeddings using vector-similarity. If the closest match exceeds a similarity threshold (typically 0.90-0.95 cosine), the cached response is returned without making an LLM API call. This generalizes beyond exact prompt caching (which providers like Anthropic offer natively as prompt caching) to handle rephrased or semantically equivalent queries. Tools like GPTCache and Redis with vector search modules implement this pattern. Trade-offs: (1) a false positive (similar but not equivalent prompt) returns a wrong answer; set the threshold high enough to prevent this; (2) embedding computation adds 5-20 ms latency; (3) cached responses go stale if the underlying data changes.

When it applies

Use semantic caching for FAQ-style queries, support chatbots, and documentation search where users rephrase the same questions differently. Do not use it for queries that embed user-specific state (account ID, session data) unless the cache key includes that data separately.

Example

Query 1: “How do I reset my password?” cached with response R. Query 2: “I forgot my password, how can I reset it?” returns R from cache at 0.93 cosine similarity, avoiding a 200 ms LLM call.

embedding - the vector representation used as the cache key.
vector-similarity - the metric that determines cache hit threshold.
embeddings - embedding model selection for low-latency cache key generation.
context-window - semantic caching reduces the number of context-window-filling API calls.
token - a cache hit avoids paying input and output token costs.

Citing this term

See Semantic cache (llmbestpractices.com/glossary/semantic-cache).

LLM Best Practices

Explorer

Overview

Definition

When it applies

Example

Citing this term

Graph View

Table of Contents

Backlinks

LLM Best Practices

Explorer

Semantic cache

Overview

Definition

When it applies

Example

Related concepts

Citing this term

Related

Graph View

Table of Contents

Backlinks