Overview
This page is the atomic definition. The deep-dive lives at evaluation.
Definition
A golden set is a hand-labeled evaluation set used as the ground truth for measuring model or retrieval quality. It pairs inputs (questions, documents, tasks) with the correct outputs or annotations (expected answer, gold chunks, expected labels). Every evaluation run scores the system’s outputs against the golden set. Size depends on stakes and budget: 50 to 200 items is common for a starter set; 1,000+ for production-critical systems. The set must be diverse enough to cover the input distribution and small enough that humans can re-grade it when the rubric changes.
When it applies
Build a golden set for any model-powered feature that needs regression protection. Required for RAG systems, classifiers, extractors, and any task where “is the new version better than the old one” is a question the team will ask repeatedly.
Example
A support-ticket classifier has a 200-question golden set: ticket text and the human-labeled priority and category. Every code change runs the classifier against the set; the run reports accuracy, per-category F1, and lists every disagreement for review.
Related concepts
- evaluation - the deep-dive on building and maintaining a golden set.
- rag-eval - RAG golden sets pair questions with gold chunk IDs.
- llm-as-judge - the technique that scales scoring past the golden set.
- retrieval-augmented-generation - the system most commonly evaluated against a golden set.
- prompt-design - prompt changes are validated against the golden set.
Citing this term
See Golden set (llmbestpractices.com/glossary/golden-set).