LLM as Judge
Using a separate LLM to evaluate the outputs of another LLM — approximating human judgment at scale. This is the most powerful eval technique for free-form generation tasks, but also the noisiest and most bias-prone.
Prerequisites
- Prompt Evaluation — the broader eval landscape this technique fits into
- Prompt Engineering — the judge itself needs a well-crafted prompt
How it works
A “judge” model receives:
- The original input (what was asked)
- The model output (what was produced)
- A rubric (what “good” means)
The judge returns a score, a pass/fail decision, and optionally a reasoning chain explaining the judgment. The judge prompt typically includes four sections: a system instruction (“you are an evaluator”), the original input, the model’s output, and the evaluation criterion (rubric). The judge produces structured output: pass/fail, a numeric score, and a reason.
Known biases
The seminal paper — Zheng et al. (2023), “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena” — introduced two benchmarks: MT-Bench (a curated set of multi-turn questions designed to test judge quality) and Chatbot Arena (a crowdsourced platform where humans compare model outputs head-to-head). Key finding: GPT-4 as judge agreed with human preferences > 80% of the time, comparable to human-human agreement. It also identified three systematic failure modes.
Position bias
LLM judges favor responses presented first (or last, depending on model) in pairwise comparisons (= showing the judge two responses side by side and asking “which is better?”). Not random — correlated with perceived quality gap.
Mitigation: Run the same comparison twice with swapped ordering and average/reconcile.
Source
“Judging the Judges: A Systematic Investigation of Position Bias in Pairwise Comparative Assessments by LLMs” (2024). Found nearly 50% of pairwise judgments were sensitive to response ordering.
Verbosity bias
Judges prefer longer, more detailed answers regardless of accuracy. A padded, verbose response scores higher than a concise correct one.
Mitigation: Add explicit rubric instructions: “Do not reward length for its own sake. A concise, accurate answer is better than a verbose inaccurate one.”
Source
“Mitigating the Bias of Large Language Model Evaluation” (HuggingFace, 2024)
Self-preference bias
A model scoring its own outputs vs others’ gives itself systematically higher scores. GPT-4 exhibits measurable self-preference.
Mitigation: Use a different model family as judge than the one being evaluated. If evaluating GPT-4o outputs, judge with Claude or Gemini.
Source
“Self-Preference Bias in LLM-as-a-Judge” (2024), arxiv.org/abs/2410.21819
Reliability research (2024-2025)
The picture is sobering:
- Self-inconsistency: “Rating Roulette” (2025) found LLM judges exhibited low intra-rater reliability (= the same judge scoring the same output multiple times gives different scores each time) across runs — scores for the same response varied significantly, making ratings “almost arbitrary” in some conditions.
- Domain-specific weakness: Stanford (2025) found “consistently low and non-significant agreement between human raters and LLMs” in automated essay scoring — cautioning against blind deployment in specialized domains.
- Design factors that help: “An Empirical Study of LLM-as-a-Judge” (2025) found that evaluation criteria quality, Chain-of-Thought reasoning, and non-deterministic sampling (counterintuitively) improved alignment with human judgments.
Rubric design
The rubric is the single most important factor in judge reliability. Vague rubrics produce noisy scores.
Bad rubric
“Is this a good response?”
This gives the judge no criteria to anchor on. Different runs will weight different dimensions randomly.
Good rubric
“Does the response answer the user’s question without including information they didn’t ask for, in 2-3 sentences, using a professional tone? Score 0 if it hallucinates facts not present in the input.”
Atomic vs composite rubrics
Atomic (one criterion per judge call):
- More reliable with smaller judge models
- Failure output tells you exactly what failed
- Costs more API calls
Composite (multi-dimensional scoring in one call):
- Cheaper (one judge call)
- Requires a strong judge model (GPT-4o+)
- Failure diagnosis is harder
Recommendation: Start with atomic rubrics. Consolidate into composite only after validating that the judge handles each criterion correctly in isolation.
Chain-of-Thought in judge prompts
Adding “First explain your reasoning, then give your score” to the judge prompt consistently improves alignment with human judgment. The reasoning forces the model to consider the rubric dimensions before committing to a score.
This is the principle behind G-Eval (Liu et al., 2023) — a framework that forces the judge to think systematically instead of jumping to a score. See G-Eval structured judging with reasoning steps for the full explanation and how promptfoo implements it.
Judge model selection
| Judge model | Strengths | Cost | When to use |
|---|---|---|---|
| GPT-4o | Best general-purpose judge | $2.50/$10 per M tokens | Quality judgment rubrics |
| Claude Sonnet | Strong reasoning, less self-preference for non-Anthropic outputs | $3/$15 per M tokens | Cross-family judging |
| GPT-4o-mini | Good for binary/simple rubrics | $0.15/$0.60 per M tokens | Cheap atomic checks |
| Gemini 2.0 Flash | Cheapest capable cloud judge | $0.10/$0.40 per M tokens | Budget-constrained CI |
| Local (Ollama) | Free, private | Compute only | Simple binary rubrics only |
Small local models as judges
Models under ~30B parameters struggle with nuanced judgment rubrics. They tend to say “yes” to everything. If using Ollama for judging, keep rubrics binary and atomic (“Does the output mention X? YES/NO”), and validate the judge against known good/bad examples before trusting it.
Validating your judge
Before relying on an automated judge, verify it on cases where you already know the right answer:
- Create 5-10 pairs of
(output, expected_judgment)— cases that are obviously good and obviously bad - Run the judge on all of them
- If agreement < 80%, the rubric is too complex for that model — simplify
This calibration step costs pennies and prevents building an entire eval suite on a broken judge.
Practical patterns
Tiered judging
Use different judge models for different assertion costs:
Simple binary check ("mentions error handling?")
→ cheap model (GPT-4o-mini, Gemini Flash)
Quality judgment ("is the strategy specific and measurable?")
→ strong model (GPT-4o, Claude Sonnet)
Never judge your own outputs
If your generation model is Claude Sonnet, don’t use Claude Sonnet as judge. Self-preference bias is real and measurable. Cross-family judging (generate with OpenAI, judge with Anthropic, or vice versa) is the safest pattern.
Scoring vs pass/fail
- Pass/fail is easier to aggregate and act on in CI (quality gate: “all evals pass”)
- Scoring (0-1 or 1-5) captures nuance but requires you to set thresholds, which creates tuning overhead
Start with pass/fail. Move to scoring only when you need to detect gradual quality drift rather than binary regressions.
See also
- Prompt Evaluation — the broader eval framework (taxonomy, design patterns, dataset construction)
- Promptfoo — implements LLM-as-judge via
llm-rubricandg-evalassertion types - Prompt Engineering — rubric design is itself a prompt engineering problem