Overview
This page is the atomic definition. Context management strategies live at prompt-design and rag.
Definition
The context window is the maximum number of tokens a language model can attend to during a single forward pass. It bounds both input (the prompt, conversation history, retrieved documents) and output (the completion). As of 2025, window sizes range from 8k tokens (smaller open-source models) to 200k (Claude 3.7 Sonnet, Gemini 1.5 Pro) and 1M+ (Gemini 1.5 Ultra). Larger windows cost more per call because attention is quadratic in sequence length. Putting everything into a long context produces mixed results: models show a “lost in the middle” effect where information in the middle of a very long prompt is recalled less reliably than information near the start or end. Retrieval-augmented generation (retrieval-augmented-generation) addresses this by fetching only the most relevant chunks, keeping the effective context short.
When it applies
Stay well under the context limit to avoid truncation errors and control cost. For documents longer than ~50 pages, use RAG instead of inserting the full text. For multi-turn chat, prune or summarize old turns before the prompt exceeds the window.
Example
A Claude 3.7 call with a 200k-token context at 0.60 per call for the context alone. Retrieval that narrows the same information to 4k tokens costs $0.012.
Related concepts
- token - the unit that the context window is measured in.
- rag - the pattern that avoids filling the context window with raw documents.
- prompt-design - techniques for staying within budget while preserving information.
- semantic-cache - caching repeated prompts avoids repeatedly filling the context window.
- retrieval-augmented-generation - the retrieval technique that manages context efficiently.
Citing this term
See Context window (llmbestpractices.com/glossary/context-window).