Evaluation Harness

Definition

An evaluation harness is a repeatable test framework for LLM systems. It takes a test set of inputs with expected outputs, runs each input through the system under test, computes metrics, and reports aggregate scores. The harness enables regression detection, A/B comparison of prompts, and model upgrade validation.

Components of a minimal harness:

Dataset loader: loads test cases (prompt, expected output, optional metadata).
Runner: calls the model or pipeline for each test case, with retries and rate limiting.
Scorer: computes per-example scores (exact match, ROUGE, F1, LLM-as-judge).
Aggregator: computes aggregate metrics (mean score, pass@k, category breakdowns).
Reporter: outputs results to a file, database, or CI dashboard.

Open-source harness frameworks: LangSmith, Braintrust, Evals (OpenAI), EleutherAI lm-evaluation-harness, promptfoo.

An eval harness should be version-controlled and run in CI on every prompt or model change. A failing eval should block deployment.

When it applies

Build an evaluation harness before iterating on prompts or models. Without it, “improvements” are anecdotal. Define success metrics before collecting data; retroactively justifying metrics is a common bias trap.

Start with exact-match or regex scoring for classification and extraction tasks (cheap, fast, deterministic). Add LLM-as-judge scoring for open-ended generation tasks where human evaluation does not scale.

Example

import anthropic, json
from statistics import mean
 
client = anthropic.Anthropic()
 
def run_eval(test_cases: list[dict], prompt_template: str) -> dict:
    scores = []
    for case in test_cases:
        response = client.messages.create(
            model="claude-sonnet-4-5",
            max_tokens=256,
            messages=[{"role": "user", "content": prompt_template.format(**case)}]
        )
        predicted = response.content[0].text.strip()
        score = 1.0 if predicted == case["expected"] else 0.0
        scores.append(score)
    return {"accuracy": mean(scores), "n": len(scores)}
 
test_set = json.loads(open("test_set.json").read())
result = run_eval(test_set, "Classify sentiment: {text}\nAnswer: positive or negative.")
print(result)

golden-set - the test data the harness runs against; quality of the golden set limits eval validity.
llm-as-judge - a scoring method used inside harnesses for open-ended generation quality.
fine-tuning - harnesses validate whether fine-tuning improves over the base model.
hallucination - hallucination rates can be measured as a metric in an eval harness.
evaluation - the deep-dive on evaluation strategy and metric choice.

Citing this term

See Evaluation Harness (llmbestpractices.com/glossary/evaluation-harness).

LLM Best Practices

Explorer

Definition

When it applies

Example

Citing this term

Graph View

Table of Contents

Backlinks

LLM Best Practices

Explorer

Evaluation Harness

Definition

When it applies

Example

Related concepts

Citing this term

Related

Graph View

Table of Contents

Backlinks