Overview

A prompt is production code. It needs an eval set, a metric, a regression run, and a promotion gate, exactly the way a function needs unit tests. Without this loop, every prompt change is a guess and every model upgrade is a coin flip. Build the eval set first; iterate on the prompt second.

Build a fixed eval set of known-good outputs

Start with 50 to 200 examples that cover the input distribution the production prompt will actually see. Each example is a pair: an input the user might send, and the output a domain expert would accept.

prompts/
  evals/
    classifier.v3.cases.json
[
  {
    "id": "billing-001",
    "input": "I was charged twice for my October subscription.",
    "expected": {"label": "billing", "confidence_min": 0.7}
  },
  {
    "id": "technical-001",
    "input": "The dashboard shows a 502 error after I log in.",
    "expected": {"label": "technical", "confidence_min": 0.7}
  }
]

Source the cases from real traffic where possible. Synthetic cases are fine for bootstrapping but always supplement with production samples once you have any. See eval-set and golden-set for the distinction between a regression set and a held-out gold standard.

Pick a metric that matches the task

The wrong metric makes a failing prompt look healthy. Pick the one that maps to the application’s actual success criterion.

  • Exact match. Extraction, classification, code that must parse. Strict; easy to gate on.
  • Structured-field match. JSON outputs where some fields are required exact and others are tolerant (e.g., label exact, rationale free-text).
  • BLEU, ROUGE, BERTScore. Translation and summarization. Noisy at small N; complement with a human spot-check.
  • LLM-as-judge. Open-ended outputs (chat replies, long-form writing). A separate strong model grades each output on a rubric. See llm-as-judge.
  • Pass-through to downstream. SQL generation, code synthesis. Run the generated artifact and grade on whether it executes correctly.

A production prompt usually needs two metrics: one strict (does it parse?) and one semantic (is the answer right?). Track both.

Run the eval set against every prompt change before merging

Wire the eval run into CI. A prompt change is a code change and gets the same gate as any other code change.

# .github/workflows/prompt-eval.yml
on: [pull_request]
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm ci
      - run: npm run eval -- --prompt prompts/classifier.v3.prompt.md \
                              --cases prompts/evals/classifier.v3.cases.json

The CI run posts the score on the PR. A drop in score is a regression and blocks merge.

For LLM-as-judge metrics, pin the judge model and the judge prompt so the score is reproducible across runs. A drifting judge is a useless judge.

Track the eval score over time

Save the score, the prompt hash, the model version, and the timestamp on every run. A dashboard of score-over-time tells you whether the prompt is improving, regressing, or holding steady when the model changes underneath it.

runs.json
{ "ts": "2026-05-15T10:00:00Z", "prompt": "v3.a", "model": "claude-opus-4-7", "score": 0.87 }
{ "ts": "2026-05-15T14:00:00Z", "prompt": "v3.b", "model": "claude-opus-4-7", "score": 0.91 }
{ "ts": "2026-05-16T09:00:00Z", "prompt": "v3.b", "model": "claude-opus-4-8", "score": 0.88 }

The model-upgrade regression in the last row is the entire reason this tracking exists. Without history, the regression looks like noise.

Promote a prompt only when it beats production at a confidence interval

A single eval-set score is a point estimate. Two prompts that score 0.87 and 0.89 on a 100-case set are usually not statistically distinguishable.

Use a confidence interval or a sign test:

  • Run both prompts on the same cases.
  • Compute the per-case win rate of the new prompt against the production prompt.
  • Promote only when the new prompt wins on a meaningful majority (e.g., 55+ percent on 100 cases, or a Wilson interval whose lower bound is above 50 percent).

For cheap metrics, run the eval multiple times and average. LLM judges are noisy at single-run; three runs averaged is the floor.

Store prompts and evals next to each other

prompts/
  classifier.v3.prompt.md
  classifier.v3.examples.json
  evals/
    classifier.v3.cases.json
    classifier.v3.runs.json

The prompt file and the eval set move together as a unit. When the prompt schema changes, the eval cases get updated in the same commit. The history of the prompt and the history of its eval cases live in the same git log. See prompt-templates for the storage convention.

Pitfalls

  • Skipping the eval set because “we can tell if the output looks right.” Subjective evaluation is not reproducible and does not catch model-upgrade regressions.
  • Evaluating only the new prompt without re-evaluating the current production prompt on the same cases. You need both scores; one is meaningless.
  • Using a tiny eval set (10 cases). The variance swamps the signal. 50 is the floor; 200 is healthy.
  • Letting the judge model drift. Pin the judge model version and the judge prompt; a new judge invalidates the score history.
  • Mixing the regression set with the gold set. The regression set teaches the prompt; the gold set audits it. Never tune against the gold set.