Inter-Rater Reliability

Inter-rater reliability measures how much agreement exists between independent raters (human or automated) beyond what random chance would produce. It answers: “if two people rate the same thing, do they agree — and is that agreement meaningful or just luck?”

Prerequisites

Basic probability (what “expected by chance” means)

Why it matters

When multiple judges rate the same output (e.g., two humans scoring LLM responses, or a human and an LLM judge), raw percent agreement is misleading. If there are only two categories and both raters default to “good,” they’ll agree 50%+ of the time by accident. Inter-rater reliability corrects for this baseline.

Cohen’s Kappa ( $κ$ )

Designed for two raters on categorical labels (e.g., pass/fail, positive/negative/neutral).

$κ = \frac{p _{o} - p _{e}}{1 - p _{e}}$

Notation

$p_{o}$ — observed agreement (fraction of cases where both raters gave the same label)

$p_{e}$ — expected agreement by chance (if each rater assigned labels randomly at their observed rates)

Interpretation:

$κ$	Meaning
< 0	Worse than chance (systematic disagreement)
0.0	Agreement equals chance
0.01 - 0.20	Slight agreement
0.21 - 0.40	Fair agreement
0.41 - 0.60	Moderate agreement
0.61 - 0.80	Substantial agreement
0.81 - 1.00	Almost perfect agreement

How to compute $p_{e}$ (expected agreement by chance)

$p_{e}$ asks: “if both raters were flipping weighted coins (using their individual label frequencies), how often would they land on the same label by accident?”

For each label $k$ , compute the probability that both raters independently pick that label:

$p_{e} = \sum_{k} P (Rater A picks k) \times P (Rater B picks k)$

Where:

$P (Rater A picks k) = \frac{number of times A used label k}{total items}$
$P (Rater B picks k) = \frac{number of times B used label k}{total items}$

In other words: look at how often each rater uses each label overall, then compute the probability they’d both land on the same label if they were rating independently (without even looking at the item).

Worked example

Two raters classify 100 LLM outputs as “good” or “bad”:

	Rater B: Good	Rater B: Bad	Total
Rater A: Good	40	10	50
Rater A: Bad	5	45	50
Total	45	55	100

Step 1 — Observed agreement ( $p_{o}$ ):

Count the cases where both raters gave the same label (the diagonal cells): $40 + 45 = 85$ agreements out of 100.

$p_{o} = \frac{85}{100} = 0.85$

Step 2 — Expected agreement by chance ( $p_{e}$ ):

Look at each rater’s marginal frequencies (the row/column totals):

Rater A said “Good” 50 times out of 100 → $P (A = Good) = 0.50$
Rater A said “Bad” 50 times out of 100 → $P (A = Bad) = 0.50$
Rater B said “Good” 45 times out of 100 → $P (B = Good) = 0.45$
Rater B said “Bad” 55 times out of 100 → $P (B = Bad) = 0.55$

If they were rating independently (random coins with these weights):

$p_{e} = P (A = Good) \times P (B = Good) + P (A = Bad) \times P (B = Bad)$ $p_{e} = (0.50 \times 0.45) + (0.50 \times 0.55) = 0.225 + 0.275 = 0.50$

So by pure chance, we’d expect 50% agreement.

Step 3 — Kappa:

$κ = \frac{p _{o} - p _{e}}{1 - p _{e}} = \frac{0.85 - 0.50}{1 - 0.50} = \frac{0.35}{0.50} = 0.70$

Interpretation: They agree 85% of the time, but chance alone would give 50%. Kappa = 0.70 means they captured 70% of the agreement that was available beyond chance — substantial agreement.

Limitations

Only works for exactly two raters
Assumes categorical (nominal) labels — doesn’t account for “how wrong” a disagreement is (rating 1 vs 5 counts the same as 4 vs 5)
Can be paradoxically low when prevalence is very skewed (if 95% of cases are “good,” even high raw agreement produces low $κ$ )

Krippendorff’s Alpha ( $α$ )

A more general metric that handles:

Any number of raters (not just two)
Ordinal and continuous scales (not just categories) — a disagreement of 1 vs 5 counts more than 4 vs 5
Missing data (raters don’t need to rate every item)

$α = 1 - \frac{D _{o}}{D _{e}}$

Notation

$D_{o}$ — observed disagreement (average distance between ratings on the same item)

$D_{e}$ — expected disagreement by chance (average distance if ratings were randomly assigned)

Interpretation: Same scale as $κ$ — 0 is chance, 1 is perfect. Krippendorff himself recommends $α \geq 0.667$ as the minimum for tentative conclusions and $α \geq 0.8$ for reliable data.

When to use which

Situation	Use
Two raters, pass/fail labels	Cohen’s $κ$
Two raters, 1-5 scale	Weighted $κ$ (accounts for distance between ratings)
3+ raters, any scale	Krippendorff’s $α$
Missing data (not all raters scored all items)	Krippendorff’s $α$

Application to LLM evaluation

In Prompt Evaluation, inter-rater reliability shows up in two ways:

Human eval quality: If your human raters disagree ( $κ < 0.4$ ), the rubric is underspecified. Fix the rubric before collecting more labels.
Judge validation: Compare LLM judge scores against human scores. If $κ < 0.6$ between the judge and humans, the judge isn’t reliable enough for that rubric — simplify the rubric or use a stronger judge model.

Edmondo's Vault

Explorer

Inter-Rater Reliability

Inter-Rater Reliability

Why it matters

Cohen’s Kappa ( $κ$ )

How to compute $p_{e}$ (expected agreement by chance)

Worked example

Limitations

Krippendorff’s Alpha ( $α$ )

When to use which

Application to LLM evaluation

See also

Graph View

Table of Contents

Backlinks

Edmondo's Vault

Explorer

Inter-Rater Reliability

Inter-Rater Reliability

Why it matters

Cohen’s Kappa (κ)

How to compute pe​ (expected agreement by chance)

Worked example

Limitations

Krippendorff’s Alpha (α)

When to use which

Application to LLM evaluation

See also

Graph View

Table of Contents

Backlinks

Cohen’s Kappa ( $κ$ )

How to compute $p_{e}$ (expected agreement by chance)

Krippendorff’s Alpha ( $α$ )