Overview

This page is the atomic definition. The deep-dive lives at evaluation.

Definition

A golden set is a hand-labeled evaluation set used as the ground truth for measuring model or retrieval quality. It pairs inputs (questions, documents, tasks) with the correct outputs or annotations (expected answer, gold chunks, expected labels). Every evaluation run scores the system’s outputs against the golden set. Size depends on stakes and budget: 50 to 200 items is common for a starter set; 1,000+ for production-critical systems. The set must be diverse enough to cover the input distribution and small enough that humans can re-grade it when the rubric changes.

When it applies

Build a golden set for any model-powered feature that needs regression protection. Required for RAG systems, classifiers, extractors, and any task where “is the new version better than the old one” is a question the team will ask repeatedly.

Example

A support-ticket classifier has a 200-question golden set: ticket text and the human-labeled priority and category. Every code change runs the classifier against the set; the run reports accuracy, per-category F1, and lists every disagreement for review.

Citing this term

See Golden set (llmbestpractices.com/glossary/golden-set).