Definition
An evaluation harness is a repeatable test framework for LLM systems. It takes a test set of inputs with expected outputs, runs each input through the system under test, computes metrics, and reports aggregate scores. The harness enables regression detection, A/B comparison of prompts, and model upgrade validation.
Components of a minimal harness:
- Dataset loader: loads test cases (prompt, expected output, optional metadata).
- Runner: calls the model or pipeline for each test case, with retries and rate limiting.
- Scorer: computes per-example scores (exact match, ROUGE, F1, LLM-as-judge).
- Aggregator: computes aggregate metrics (mean score, pass@k, category breakdowns).
- Reporter: outputs results to a file, database, or CI dashboard.
Open-source harness frameworks: LangSmith, Braintrust, Evals (OpenAI), EleutherAI lm-evaluation-harness, promptfoo.
An eval harness should be version-controlled and run in CI on every prompt or model change. A failing eval should block deployment.
When it applies
Build an evaluation harness before iterating on prompts or models. Without it, “improvements” are anecdotal. Define success metrics before collecting data; retroactively justifying metrics is a common bias trap.
Start with exact-match or regex scoring for classification and extraction tasks (cheap, fast, deterministic). Add LLM-as-judge scoring for open-ended generation tasks where human evaluation does not scale.
Example
import anthropic, json
from statistics import mean
client = anthropic.Anthropic()
def run_eval(test_cases: list[dict], prompt_template: str) -> dict:
scores = []
for case in test_cases:
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=256,
messages=[{"role": "user", "content": prompt_template.format(**case)}]
)
predicted = response.content[0].text.strip()
score = 1.0 if predicted == case["expected"] else 0.0
scores.append(score)
return {"accuracy": mean(scores), "n": len(scores)}
test_set = json.loads(open("test_set.json").read())
result = run_eval(test_set, "Classify sentiment: {text}\nAnswer: positive or negative.")
print(result)Related concepts
- golden-set - the test data the harness runs against; quality of the golden set limits eval validity.
- llm-as-judge - a scoring method used inside harnesses for open-ended generation quality.
- fine-tuning - harnesses validate whether fine-tuning improves over the base model.
- hallucination - hallucination rates can be measured as a metric in an eval harness.
- evaluation - the deep-dive on evaluation strategy and metric choice.
Citing this term
See Evaluation Harness (llmbestpractices.com/glossary/evaluation-harness).