Inter-Rater Reliability
Inter-rater reliability measures how much agreement exists between independent raters (human or automated) beyond what random chance would produce. It answers: “if two people rate the same thing, do they agree — and is that agreement meaningful or just luck?”
Prerequisites
- Basic probability (what “expected by chance” means)
Why it matters
When multiple judges rate the same output (e.g., two humans scoring LLM responses, or a human and an LLM judge), raw percent agreement is misleading. If there are only two categories and both raters default to “good,” they’ll agree 50%+ of the time by accident. Inter-rater reliability corrects for this baseline.
Cohen’s Kappa ()
Designed for two raters on categorical labels (e.g., pass/fail, positive/negative/neutral).
Notation
- — observed agreement (fraction of cases where both raters gave the same label)
- — expected agreement by chance (if each rater assigned labels randomly at their observed rates)
Interpretation:
| Meaning | |
|---|---|
| < 0 | Worse than chance (systematic disagreement) |
| 0.0 | Agreement equals chance |
| 0.01 - 0.20 | Slight agreement |
| 0.21 - 0.40 | Fair agreement |
| 0.41 - 0.60 | Moderate agreement |
| 0.61 - 0.80 | Substantial agreement |
| 0.81 - 1.00 | Almost perfect agreement |
How to compute (expected agreement by chance)
asks: “if both raters were flipping weighted coins (using their individual label frequencies), how often would they land on the same label by accident?”
For each label , compute the probability that both raters independently pick that label:
Where:
In other words: look at how often each rater uses each label overall, then compute the probability they’d both land on the same label if they were rating independently (without even looking at the item).
Worked example
Two raters classify 100 LLM outputs as “good” or “bad”:
| Rater B: Good | Rater B: Bad | Total | |
|---|---|---|---|
| Rater A: Good | 40 | 10 | 50 |
| Rater A: Bad | 5 | 45 | 50 |
| Total | 45 | 55 | 100 |
Step 1 — Observed agreement ():
Count the cases where both raters gave the same label (the diagonal cells): agreements out of 100.
Step 2 — Expected agreement by chance ():
Look at each rater’s marginal frequencies (the row/column totals):
- Rater A said “Good” 50 times out of 100 →
- Rater A said “Bad” 50 times out of 100 →
- Rater B said “Good” 45 times out of 100 →
- Rater B said “Bad” 55 times out of 100 →
If they were rating independently (random coins with these weights):
So by pure chance, we’d expect 50% agreement.
Step 3 — Kappa:
Interpretation: They agree 85% of the time, but chance alone would give 50%. Kappa = 0.70 means they captured 70% of the agreement that was available beyond chance — substantial agreement.
Limitations
- Only works for exactly two raters
- Assumes categorical (nominal) labels — doesn’t account for “how wrong” a disagreement is (rating 1 vs 5 counts the same as 4 vs 5)
- Can be paradoxically low when prevalence is very skewed (if 95% of cases are “good,” even high raw agreement produces low )
Krippendorff’s Alpha ()
A more general metric that handles:
- Any number of raters (not just two)
- Ordinal and continuous scales (not just categories) — a disagreement of 1 vs 5 counts more than 4 vs 5
- Missing data (raters don’t need to rate every item)
Notation
- — observed disagreement (average distance between ratings on the same item)
- — expected disagreement by chance (average distance if ratings were randomly assigned)
Interpretation: Same scale as — 0 is chance, 1 is perfect. Krippendorff himself recommends as the minimum for tentative conclusions and for reliable data.
When to use which
| Situation | Use |
|---|---|
| Two raters, pass/fail labels | Cohen’s |
| Two raters, 1-5 scale | Weighted (accounts for distance between ratings) |
| 3+ raters, any scale | Krippendorff’s |
| Missing data (not all raters scored all items) | Krippendorff’s |
Application to LLM evaluation
In Prompt Evaluation, inter-rater reliability shows up in two ways:
- Human eval quality: If your human raters disagree (), the rubric is underspecified. Fix the rubric before collecting more labels.
- Judge validation: Compare LLM judge scores against human scores. If between the judge and humans, the judge isn’t reliable enough for that rubric — simplify the rubric or use a stronger judge model.
See also
- Prompt Evaluation — where inter-rater reliability is used in practice
- LLM as Judge — validating automated judges against human agreement