LLM as Judge

Using a separate LLM to evaluate the outputs of another LLM — approximating human judgment at scale. This is the most powerful eval technique for free-form generation tasks, but also the noisiest and most bias-prone.

Prerequisites

How it works

A “judge” model receives:

  1. The original input (what was asked)
  2. The model output (what was produced)
  3. A rubric (what “good” means)

The judge returns a score, a pass/fail decision, and optionally a reasoning chain explaining the judgment. The judge prompt typically includes four sections: a system instruction (“you are an evaluator”), the original input, the model’s output, and the evaluation criterion (rubric). The judge produces structured output: pass/fail, a numeric score, and a reason.

Known biases

The seminal paper — Zheng et al. (2023), “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena” — introduced two benchmarks: MT-Bench (a curated set of multi-turn questions designed to test judge quality) and Chatbot Arena (a crowdsourced platform where humans compare model outputs head-to-head). Key finding: GPT-4 as judge agreed with human preferences > 80% of the time, comparable to human-human agreement. It also identified three systematic failure modes.

Position bias

LLM judges favor responses presented first (or last, depending on model) in pairwise comparisons (= showing the judge two responses side by side and asking “which is better?”). Not random — correlated with perceived quality gap.

Mitigation: Run the same comparison twice with swapped ordering and average/reconcile.

Source

“Judging the Judges: A Systematic Investigation of Position Bias in Pairwise Comparative Assessments by LLMs” (2024). Found nearly 50% of pairwise judgments were sensitive to response ordering.

Verbosity bias

Judges prefer longer, more detailed answers regardless of accuracy. A padded, verbose response scores higher than a concise correct one.

Mitigation: Add explicit rubric instructions: “Do not reward length for its own sake. A concise, accurate answer is better than a verbose inaccurate one.”

Source

“Mitigating the Bias of Large Language Model Evaluation” (HuggingFace, 2024)

Self-preference bias

A model scoring its own outputs vs others’ gives itself systematically higher scores. GPT-4 exhibits measurable self-preference.

Mitigation: Use a different model family as judge than the one being evaluated. If evaluating GPT-4o outputs, judge with Claude or Gemini.

Source

“Self-Preference Bias in LLM-as-a-Judge” (2024), arxiv.org/abs/2410.21819

Reliability research (2024-2025)

The picture is sobering:

  • Self-inconsistency: “Rating Roulette” (2025) found LLM judges exhibited low intra-rater reliability (= the same judge scoring the same output multiple times gives different scores each time) across runs — scores for the same response varied significantly, making ratings “almost arbitrary” in some conditions.
  • Domain-specific weakness: Stanford (2025) found “consistently low and non-significant agreement between human raters and LLMs” in automated essay scoring — cautioning against blind deployment in specialized domains.
  • Design factors that help: “An Empirical Study of LLM-as-a-Judge” (2025) found that evaluation criteria quality, Chain-of-Thought reasoning, and non-deterministic sampling (counterintuitively) improved alignment with human judgments.

Rubric design

The rubric is the single most important factor in judge reliability. Vague rubrics produce noisy scores.

Bad rubric

“Is this a good response?”

This gives the judge no criteria to anchor on. Different runs will weight different dimensions randomly.

Good rubric

“Does the response answer the user’s question without including information they didn’t ask for, in 2-3 sentences, using a professional tone? Score 0 if it hallucinates facts not present in the input.”

Atomic vs composite rubrics

Atomic (one criterion per judge call):

  • More reliable with smaller judge models
  • Failure output tells you exactly what failed
  • Costs more API calls

Composite (multi-dimensional scoring in one call):

  • Cheaper (one judge call)
  • Requires a strong judge model (GPT-4o+)
  • Failure diagnosis is harder

Recommendation: Start with atomic rubrics. Consolidate into composite only after validating that the judge handles each criterion correctly in isolation.

Chain-of-Thought in judge prompts

Adding “First explain your reasoning, then give your score” to the judge prompt consistently improves alignment with human judgment. The reasoning forces the model to consider the rubric dimensions before committing to a score.

This is the principle behind G-Eval (Liu et al., 2023) — a framework that forces the judge to think systematically instead of jumping to a score. See G-Eval structured judging with reasoning steps for the full explanation and how promptfoo implements it.

Judge model selection

Judge modelStrengthsCostWhen to use
GPT-4oBest general-purpose judge$2.50/$10 per M tokensQuality judgment rubrics
Claude SonnetStrong reasoning, less self-preference for non-Anthropic outputs$3/$15 per M tokensCross-family judging
GPT-4o-miniGood for binary/simple rubrics$0.15/$0.60 per M tokensCheap atomic checks
Gemini 2.0 FlashCheapest capable cloud judge$0.10/$0.40 per M tokensBudget-constrained CI
Local (Ollama)Free, privateCompute onlySimple binary rubrics only

Small local models as judges

Models under ~30B parameters struggle with nuanced judgment rubrics. They tend to say “yes” to everything. If using Ollama for judging, keep rubrics binary and atomic (“Does the output mention X? YES/NO”), and validate the judge against known good/bad examples before trusting it.

Validating your judge

Before relying on an automated judge, verify it on cases where you already know the right answer:

  1. Create 5-10 pairs of (output, expected_judgment) — cases that are obviously good and obviously bad
  2. Run the judge on all of them
  3. If agreement < 80%, the rubric is too complex for that model — simplify

This calibration step costs pennies and prevents building an entire eval suite on a broken judge.

Practical patterns

Tiered judging

Use different judge models for different assertion costs:

Simple binary check ("mentions error handling?")
  → cheap model (GPT-4o-mini, Gemini Flash)

Quality judgment ("is the strategy specific and measurable?")
  → strong model (GPT-4o, Claude Sonnet)

Never judge your own outputs

If your generation model is Claude Sonnet, don’t use Claude Sonnet as judge. Self-preference bias is real and measurable. Cross-family judging (generate with OpenAI, judge with Anthropic, or vice versa) is the safest pattern.

Scoring vs pass/fail

  • Pass/fail is easier to aggregate and act on in CI (quality gate: “all evals pass”)
  • Scoring (0-1 or 1-5) captures nuance but requires you to set thresholds, which creates tuning overhead

Start with pass/fail. Move to scoring only when you need to detect gradual quality drift rather than binary regressions.

See also

  • Prompt Evaluation — the broader eval framework (taxonomy, design patterns, dataset construction)
  • Promptfoo — implements LLM-as-judge via llm-rubric and g-eval assertion types
  • Prompt Engineering — rubric design is itself a prompt engineering problem