Eval Set

Overview

This page is the atomic definition. The infrastructure to run evaluations lives at evaluation-harness.

Definition

An eval set (evaluation set) is a fixed dataset of input prompts paired with expected outputs or evaluation criteria, used to measure and compare the performance of LLM systems. It serves the same function as a test suite in software engineering: a stable reference that lets you detect when a change (new model, new prompt, new retrieval strategy) has improved or degraded behavior. A well-constructed eval set is: representative of the production input distribution; diverse across edge cases and difficulty levels; balanced across categories; and large enough to produce statistically stable metrics (typically 200+ examples for single-metric evals). Eval sets must be held out from any training or few-shot example selection. They decay over time as the domain changes and must be refreshed. For rag systems, eval sets typically include: query, expected answer, and expected source document(s). Metrics computed over the eval set include accuracy, hallucination-rate, faithfulness, relevance, and latency.

When it applies

Build an eval set before deploying any LLM feature to production. Run it on every model upgrade, prompt change, and retrieval configuration change. Treat a failing eval set as a blocking condition for deployment, the same as a failing test suite.

Example

A document Q&A system has an eval set of 400 questions. Each entry has a question, the document it references, and the expected answer. When a new embedding model is evaluated, it scores 91% faithfulness versus 88% for the old model. The new model ships. Three months later, 20 questions are added to cover newly observed failure modes.

ground-truth - the verified answers that make up the expected output side of the eval set.
golden-set - a high-confidence subset of the eval set used for quick regression checks.
evaluation-harness - the runner that executes the eval set and reports metrics.
hallucination-rate - one of the primary metrics computed over an eval set.
llm-as-judge - automated scoring that enables eval sets to scale beyond human annotation.

Citing this term

See Eval Set (llmbestpractices.com/glossary/eval-set).

LLM Best Practices

Explorer

Overview

Definition

When it applies

Example

Citing this term

Graph View

Table of Contents

Backlinks

LLM Best Practices

Explorer

Eval Set

Overview

Definition

When it applies

Example

Related concepts

Citing this term

Related

Graph View

Table of Contents

Backlinks