Prompt Evaluation

Prompt evaluation (“evals”) is the discipline of systematically testing whether LLM prompts produce correct, consistent, and useful outputs. It addresses a fundamental gap: traditional software testing assumes deterministic functions where the same input always yields the same output — LLMs violate every one of those assumptions.

Prerequisites

Prompt Engineering — techniques for crafting prompts

Basic understanding of Sampling techniques (temperature, top-p)

Why LLM evaluation is hard

Property	Deterministic software	LLMs
Same input → same output	Always	Probabilistic (temperature > 0)
Output space	Finite, enumerable	Effectively unbounded natural language
”Correct” answer	Objectively defined	Often subjective, context-dependent
Test oracle (= how you know the right answer)	Exact comparison	Requires semantic judgment
Failure signal	Stack trace / exception	Plausible-sounding wrong answer

The most dangerous failure mode is the confident, fluent hallucination — the system produces an answer that looks correct and would pass a naive string check, but is factually wrong or task-inappropriate. Traditional unit tests have no mechanism to catch this.

Taxonomy of eval approaches

Deterministic checks

Work when the output space is finite and enumerable: classification labels, extracted entities, structured JSON, single numeric answers.

Exact match / equals — brittle for free-text; a model saying “Positive” vs “positive” vs “The sentiment is positive” may all be correct
Substring / regex — fast, free, but can’t reason about semantics (“contains the word ‘safe’” does not mean the response conveys safety)
JSON schema validation — gates more specific field-level checks
Cost / latency guards — budget enforcement in CI

Pattern: Use deterministic checks as gates before more expensive semantic checks. If it’s not valid JSON, skip the field-level and semantic assertions.

Model-graded evals (LLM-as-judge)

A separate LLM (the “judge”) is prompted with [input, output, rubric] and asked to score or pass/fail the response. A rubric is a scoring guide — a written description of what “good” looks like, broken into specific criteria the judge checks against (e.g., “the answer must be factually correct, under 3 sentences, and not include information the user didn’t ask for”). This approximates human judgment at scale.

See LLM as Judge for the full treatment: known biases, rubric design, judge model selection, and the seminal research.

Embedding similarity

Embed both the model output and a reference response, then compute Cosine Similarity between the two vectors. BERTScore is a variant that does this comparison token-by-token and computes precision/recall/F1 across all tokens.

When useful: Summarization where wording doesn’t matter but meaning must be preserved; RAG faithfulness checks.

When misleading: Classification/extraction (semantic similarity to the gold answer doesn’t tell you if the model picked the right label); short outputs (1-3 words where embedding similarity is noisy); numerical divergence (“$100” vs “$10,000” can score highly similar).

Human evals

When necessary:

Novel task types where you don’t yet know what “correct” looks like
High-stakes decisions (medical, legal, financial)
Calibrating/auditing your automated judge
When LLM-judge agreement with humans is unknown for your domain

Measure Inter-Rater Reliability to verify raters actually agree beyond chance. Use Cohen’s Kappa ( $κ$ ) for two raters on categorical labels, or Krippendorff’s Alpha ( $α$ ) for multiple raters on ordinal/continuous scales. Both output a number between 0 (agreement no better than chance) and 1 (perfect agreement). $κ > 0.6$ is conventionally “substantial agreement.” If $κ < 0.4$ , the rubric is underspecified and the human evals are noise.

Custom programmatic assertions (composite checks)

The most underused but most powerful pattern. Combine multiple assertion types into a single logical check:

format-gate → field-existence → field-value → semantic/domain logic

A model might return valid JSON (passes format gate) but with a hallucinated field name or out-of-range value. Layering catches each failure mode independently, and the failure output tells you which layer failed.

Eval design patterns

Handling non-determinism

Strategy	Tradeoff
`temperature=0` + fixed seed	Near-deterministic, but not all providers support it; doesn’t reflect production behavior
Run N times, check pass rate	Reflects real variance; multiplies cost
Statistical significance testing	Gold standard for comparing prompt versions; requires enough samples
Threshold-based pass rate	”Passes if it passes $\geq$ 80% of N runs” — practical middle ground

For comparing two prompt versions statistically: with N=100 per condition and a real difference of 5 percentage points, you need ~800 examples to detect it at $p < 0.05$ with 80% power. Don’t draw strong conclusions from N < 50 for close comparisons.

Building a good eval dataset

Category	Purpose	Proportion
Happy path / typical	Representative of real production traffic	~50%
Edge cases	Near-boundary inputs, unusual formats, empty/null	~20%
Adversarial	Inputs designed to trigger known failure modes	~15%
Regression anchors	Cases that previously failed and were fixed	~15%

Where eval data should come from (ranked by quality):

Real production logs (sanitized) — highest ecological validity
Human-written examples by domain experts
LLM-generated + human-reviewed
LLM-generated only — lowest quality, high risk of systematic gaps

Minimum sample size: How many test cases do you need before you can trust your pass rate isn’t just noise? This depends on how precise you want the measurement to be.

The formula comes from the standard sample-size calculation for estimating a proportion (here, the pass rate) with a given confidence and margin of error:

$n = \frac{z ^{2} \times p \times ( 1 - p )}{E ^{2}}$

Notation

$n$ — required number of test cases

$z$ — z-score for desired confidence level (1.96 for 95% confidence)

$p$ — estimated pass rate (use 0.5 = worst case, maximizes the required $n$ )

$E$ — acceptable margin of error (how close to the “true” pass rate you want to be)

With 95% confidence and $\pm$ 5% margin of error:

$n = \frac{1.9 6 ^{2} \times 0.5 \times 0.5}{0.0 5 ^{2}} \approx 385$

In plain terms: if your eval suite has 385 cases and you measure a 90% pass rate, the true pass rate is between 85% and 95% with 95% confidence. With only 50 cases, that same measurement gives you a margin of $\pm$ 14% — too noisy to detect a 5-point regression.

Versioning eval sets — avoiding “teaching to the test”

The anti-pattern: iteratively improving a prompt against the same 50 test cases. By iteration 10, the prompt is excellent at those 50 cases and may have degraded on everything else. This is Goodhart’s Law applied to evals.

Disciplines:

Frozen test split — a portion of the eval set is never used for prompt iteration. Run it only at milestone checkpoints.
Eval dataset versioning — store eval sets in git alongside prompt configs. Treat eval changes as code changes requiring review.
Regular refresh — every few weeks, add new examples from production traffic. Retire stale examples if the use case has changed.

Red-teaming and adversarial evals

Systematically trying to break the prompt/system, not just validating the happy path:

Type	Description
Prompt injection	User input attempting to override system prompt
Jailbreaks	Social engineering to bypass safety constraints
Robustness to paraphrase	Same question phrased 5 different ways — consistent answers?
Out-of-distribution	Languages, formats, domains the model wasn’t designed for
Hallucination stress tests	Questions designed to produce plausible-sounding wrong answers

Run adversarial evals separately from regression evals — they’re a different signal (security/safety vs. accuracy/helpfulness).

Known failure modes of eval systems

Failure mode	Description	Mitigation
Goodhart’s Law	Metric becomes the target; model optimizes the eval, not the task	Rotate eval sets; hold out a never-seen test split
Overfitting to eval set	Repeated tuning against same N examples inflates scores	Maintain frozen “test” set; use separate “dev” set for iteration
Score inflation from verbosity	LLM judges prefer longer answers	Scoring rubrics with explicit length-penalty criteria
Coverage illusion	95% pass rate but all failures in one critical category	Stratify by task type; track per-category pass rates
Missing the regression	Prompt change improves one dimension, silently degrades another	Multi-dimensional evals with composite scoring

Assertion selection by task type

Task type	Primary assertion	Secondary	Notes
Classification	Exact match on label	LLM-rubric for borderline cases	Parse before asserting if output is verbose
Extraction	JSON schema + field checks	Factuality	Never use equals on whole JSON
Summarization	LLM-rubric with explicit rubric	Embedding similarity	Hard to test deterministically
Generation (freeform)	LLM-rubric (tone, style, completeness)	Contains-all for required elements	Most expensive to eval
Structured output	JSON → schema → field logic	Latency / cost	Composite assertion pattern
RAG answers	Context-faithfulness	Answer-relevance
Code generation	Execute + check output	Syntax markers	Running the code is the best test

”Good enough” vs over-engineering

Good enough for current stage:

Detects the most common regression types (wrong label, invalid format, missing required content)
Has $\geq 1$ test case per meaningful input category
Runs in < 5 minutes in CI
When you make a prompt change, the evals tell you something you didn’t already know
At least one automated assertion has been validated against human judgment

Over-engineering signals:

Spending more time maintaining evals than improving the product
Every test case requires a model-graded assertion (cost explosion)
500 eval cases before 100 real users
Evals take > 20 minutes; people start skipping them

Heuristic: Evals should cost ~10-15% of LLM development time, not 50%.

Edmondo's Vault

Explorer

Prompt Evaluation

Prompt Evaluation

Why LLM evaluation is hard

Taxonomy of eval approaches

Deterministic checks

Model-graded evals (LLM-as-judge)

Embedding similarity

Human evals

Custom programmatic assertions (composite checks)

Eval design patterns

Handling non-determinism

Building a good eval dataset

Versioning eval sets — avoiding “teaching to the test”

Red-teaming and adversarial evals

Known failure modes of eval systems

Assertion selection by task type

”Good enough” vs over-engineering

See also

Graph View

Table of Contents

Backlinks