Overview
This page is the atomic definition. Sampling and inference configuration live at prompt-design.
Definition
Top-p sampling (nucleus sampling) restricts the token vocabulary considered at each generation step to the smallest set of tokens whose cumulative probability mass exceeds p. At top_p=0.9, the model considers only the tokens that together account for 90% of the probability mass, discarding the long tail of unlikely tokens. The remaining candidates are renormalized and sampled. Top-p and temperature interact: temperature changes the shape of the distribution before top-p clips the tail. Setting top_p=1.0 disables nucleus sampling (the full vocabulary is available). Setting top_p=0.1 produces very conservative output that is close to greedy. Most production configurations tune only one of temperature or top-p and leave the other at its maximum value; mixing both requires understanding their interaction. OpenAI, Anthropic, and Google all expose top_p as an inference parameter.
When it applies
Set top_p below 1.0 to reduce incoherent or off-topic token selection in creative tasks. For deterministic pipelines, set temperature=0 and leave top_p=1.0; greedy decoding at temperature zero is simpler to reason about than nucleus sampling at low p.
Example
At top_p=0.9, if the top 3 tokens cover 92% of probability mass, only those 3 are considered. The rare fourth token (2% probability) is excluded even though it might occasionally be correct.
Related concepts
- temperature - the companion sampling parameter that reshapes the distribution.
- token - top-p filters the set of next-token candidates.
- prompt-design - guidance on pairing temperature and top-p for different tasks.
- structured-output - structured outputs pair with low temperature; top-p is less critical.
Citing this term
See Top-p (nucleus sampling) (llmbestpractices.com/glossary/top-p).