Overview
This page is the atomic definition. Evaluation infrastructure patterns live at evaluation-harness.
Definition
Ground truth is the verified correct answer or expected output for a given input, used as the reference standard when evaluating an LLM system. It is established by human experts, authoritative data sources, or unambiguous formal systems (e.g., unit test pass/fail, SQL query results). Evaluation metrics like accuracy, F1, BLEU, ROUGE, and faithfulness all require comparing model outputs against ground truth. The quality of an evaluation is bounded by the quality of its ground truth: noisy or incorrect ground truth makes metrics unreliable. Collecting ground truth is expensive; strategies to reduce cost include: sampling from existing labeled datasets, using SMEs to annotate a representative subset, using production logs to surface high-value edge cases, and synthetic generation with manual verification. For rag systems, ground truth includes both the expected answer and the expected source document(s). Ground truth differs from golden-set: a golden set is a curated subset of high-confidence, high-signal examples, whereas ground truth labels can exist for any input including noisy or borderline cases.
When it applies
Create ground truth for every task where you need to measure LLM performance objectively. Prioritize coverage of high-stakes or high-frequency inputs. Update ground truth when the domain changes (new products, new laws, new terminology) to avoid evaluating against stale references.
Example
A customer support classifier needs to route tickets to the correct team. 1,000 historical tickets are manually labeled by support managers. These labeled pairs (ticket text, correct team) become the ground truth. Model accuracy is measured as the fraction of predictions matching these labels.
Related concepts
- eval-set - the collection of input-output pairs that uses ground truth as the output.
- golden-set - a high-confidence subset of ground-truth examples used for regression testing.
- hallucination-rate - measured by comparing outputs against ground truth.
- evaluation-harness - the framework that runs evaluation against ground truth at scale.
- llm-as-judge - an automated alternative when human ground truth is unavailable.
Citing this term
See Ground Truth (llmbestpractices.com/glossary/ground-truth).