Overview
This page is the atomic definition. The infrastructure to run evaluations lives at evaluation-harness.
Definition
An eval set (evaluation set) is a fixed dataset of input prompts paired with expected outputs or evaluation criteria, used to measure and compare the performance of LLM systems. It serves the same function as a test suite in software engineering: a stable reference that lets you detect when a change (new model, new prompt, new retrieval strategy) has improved or degraded behavior. A well-constructed eval set is: representative of the production input distribution; diverse across edge cases and difficulty levels; balanced across categories; and large enough to produce statistically stable metrics (typically 200+ examples for single-metric evals). Eval sets must be held out from any training or few-shot example selection. They decay over time as the domain changes and must be refreshed. For rag systems, eval sets typically include: query, expected answer, and expected source document(s). Metrics computed over the eval set include accuracy, hallucination-rate, faithfulness, relevance, and latency.
When it applies
Build an eval set before deploying any LLM feature to production. Run it on every model upgrade, prompt change, and retrieval configuration change. Treat a failing eval set as a blocking condition for deployment, the same as a failing test suite.
Example
A document Q&A system has an eval set of 400 questions. Each entry has a question, the document it references, and the expected answer. When a new embedding model is evaluated, it scores 91% faithfulness versus 88% for the old model. The new model ships. Three months later, 20 questions are added to cover newly observed failure modes.
Related concepts
- ground-truth - the verified answers that make up the expected output side of the eval set.
- golden-set - a high-confidence subset of the eval set used for quick regression checks.
- evaluation-harness - the runner that executes the eval set and reports metrics.
- hallucination-rate - one of the primary metrics computed over an eval set.
- llm-as-judge - automated scoring that enables eval sets to scale beyond human annotation.
Citing this term
See Eval Set (llmbestpractices.com/glossary/eval-set).