Overview
This page is the atomic definition. Caching and cost reduction patterns live at embeddings.
Definition
A semantic cache stores language model responses keyed by a vector embedding of the prompt rather than an exact string match. When a new prompt arrives, its embedding is computed and compared against cached prompt embeddings using vector-similarity. If the closest match exceeds a similarity threshold (typically 0.90-0.95 cosine), the cached response is returned without making an LLM API call. This generalizes beyond exact prompt caching (which providers like Anthropic offer natively as prompt caching) to handle rephrased or semantically equivalent queries. Tools like GPTCache and Redis with vector search modules implement this pattern. Trade-offs: (1) a false positive (similar but not equivalent prompt) returns a wrong answer; set the threshold high enough to prevent this; (2) embedding computation adds 5-20 ms latency; (3) cached responses go stale if the underlying data changes.
When it applies
Use semantic caching for FAQ-style queries, support chatbots, and documentation search where users rephrase the same questions differently. Do not use it for queries that embed user-specific state (account ID, session data) unless the cache key includes that data separately.
Example
Query 1: “How do I reset my password?” cached with response R. Query 2: “I forgot my password, how can I reset it?” returns R from cache at 0.93 cosine similarity, avoiding a 200 ms LLM call.
Related concepts
- embedding - the vector representation used as the cache key.
- vector-similarity - the metric that determines cache hit threshold.
- embeddings - embedding model selection for low-latency cache key generation.
- context-window - semantic caching reduces the number of context-window-filling API calls.
- token - a cache hit avoids paying input and output token costs.
Citing this term
See Semantic cache (llmbestpractices.com/glossary/semantic-cache).