Overview

This page is the atomic definition. The deep-dive lives at evaluation.

Definition

LLM-as-judge is the practice of using a language model to score the outputs of another model against a rubric. The judge model reads the task, the candidate response, and (optionally) a reference answer, then returns a score plus rationale. Used to scale evaluation past hand-grading. Judge prompts must be calibrated: known failure modes include positional bias (preferring the first option presented), verbosity bias (preferring longer answers), and self-preference (a model rating its own outputs higher). Pair LLM-as-judge with a hand-graded golden set to verify the judge agrees with human raters at acceptable correlation (often Pearson 0.6+).

When it applies

Use LLM-as-judge for large-scale eval runs (hundreds to thousands of items) where hand-grading is too expensive. Skip it for high-stakes evals where the cost of a misgraded item outweighs the cost of human review.

Example

A RAG eval runs 500 test questions through the system. A judge model rates each answer 1-5 for groundedness against the retrieved chunks. The aggregated score is the run’s groundedness metric; the rationale lets engineers debug regressions.

  • evaluation - the deep-dive with rubric and calibration patterns.
  • golden-set - the hand-labeled ground truth that calibrates the judge.
  • rag-eval - RAG-specific eval patterns where LLM-as-judge is common.
  • hallucination - the most common failure mode the judge looks for.
  • prompt-design - the discipline that produces a calibrated judge prompt.

Citing this term

See LLM-as-judge (llmbestpractices.com/glossary/llm-as-judge).