Overview

This page is the atomic definition. Sampling and inference configuration live at prompt-design.

Definition

Temperature is a scalar applied to the logits (raw token probability scores) before sampling the next token. Dividing logits by a temperature below 1.0 sharpens the distribution, making the highest-probability token more likely and reducing variance. Dividing by a temperature above 1.0 flattens the distribution, giving lower-probability tokens more opportunity to be sampled and increasing variance. At temperature=0, the model always picks the highest-probability token (greedy decoding), producing deterministic output for the same prompt and model state. At temperature=1.0, the model samples from the raw distribution. Typical recommendations: use 0 or 0.1 for structured extraction, code generation, and fact retrieval; use 0.7-1.0 for creative writing, brainstorming, and varied generation. Most providers cap temperature at 2.0; values much above 1.0 produce incoherent output for most tasks.

When it applies

Set temperature explicitly for any production use case. Default temperature varies by provider and model. Use 0 for deterministic pipelines (parsing, classification, extraction) where reproducibility matters. Use higher values for creative tasks where diversity is the goal.

Example

A data extraction prompt with temperature=0 returns the same JSON every run given the same input, enabling reliable testing. The same prompt at temperature=1.0 may vary field capitalization or phrasing across runs.

  • top-p - the complementary sampling parameter that limits vocabulary size.
  • token - temperature controls the distribution over the next token at each step.
  • prompt-design - when to set temperature for different task types.
  • structured-output - structured output schemas pair with low temperature for reliability.

Citing this term

See Temperature (llmbestpractices.com/glossary/temperature).