Definition

An evaluation harness is a repeatable test framework for LLM systems. It takes a test set of inputs with expected outputs, runs each input through the system under test, computes metrics, and reports aggregate scores. The harness enables regression detection, A/B comparison of prompts, and model upgrade validation.

Components of a minimal harness:

  1. Dataset loader: loads test cases (prompt, expected output, optional metadata).
  2. Runner: calls the model or pipeline for each test case, with retries and rate limiting.
  3. Scorer: computes per-example scores (exact match, ROUGE, F1, LLM-as-judge).
  4. Aggregator: computes aggregate metrics (mean score, pass@k, category breakdowns).
  5. Reporter: outputs results to a file, database, or CI dashboard.

Open-source harness frameworks: LangSmith, Braintrust, Evals (OpenAI), EleutherAI lm-evaluation-harness, promptfoo.

An eval harness should be version-controlled and run in CI on every prompt or model change. A failing eval should block deployment.

When it applies

Build an evaluation harness before iterating on prompts or models. Without it, “improvements” are anecdotal. Define success metrics before collecting data; retroactively justifying metrics is a common bias trap.

Start with exact-match or regex scoring for classification and extraction tasks (cheap, fast, deterministic). Add LLM-as-judge scoring for open-ended generation tasks where human evaluation does not scale.

Example

import anthropic, json
from statistics import mean
 
client = anthropic.Anthropic()
 
def run_eval(test_cases: list[dict], prompt_template: str) -> dict:
    scores = []
    for case in test_cases:
        response = client.messages.create(
            model="claude-sonnet-4-5",
            max_tokens=256,
            messages=[{"role": "user", "content": prompt_template.format(**case)}]
        )
        predicted = response.content[0].text.strip()
        score = 1.0 if predicted == case["expected"] else 0.0
        scores.append(score)
    return {"accuracy": mean(scores), "n": len(scores)}
 
test_set = json.loads(open("test_set.json").read())
result = run_eval(test_set, "Classify sentiment: {text}\nAnswer: positive or negative.")
print(result)
  • golden-set - the test data the harness runs against; quality of the golden set limits eval validity.
  • llm-as-judge - a scoring method used inside harnesses for open-ended generation quality.
  • fine-tuning - harnesses validate whether fine-tuning improves over the base model.
  • hallucination - hallucination rates can be measured as a metric in an eval harness.
  • evaluation - the deep-dive on evaluation strategy and metric choice.

Citing this term

See Evaluation Harness (llmbestpractices.com/glossary/evaluation-harness).