Prompt Evaluation

Prompt evaluation (“evals”) is the discipline of systematically testing whether LLM prompts produce correct, consistent, and useful outputs. It addresses a fundamental gap: traditional software testing assumes deterministic functions where the same input always yields the same output — LLMs violate every one of those assumptions.

Prerequisites

Why LLM evaluation is hard

PropertyDeterministic softwareLLMs
Same input → same outputAlwaysProbabilistic (temperature > 0)
Output spaceFinite, enumerableEffectively unbounded natural language
”Correct” answerObjectively definedOften subjective, context-dependent
Test oracle (= how you know the right answer)Exact comparisonRequires semantic judgment
Failure signalStack trace / exceptionPlausible-sounding wrong answer

The most dangerous failure mode is the confident, fluent hallucination — the system produces an answer that looks correct and would pass a naive string check, but is factually wrong or task-inappropriate. Traditional unit tests have no mechanism to catch this.

Taxonomy of eval approaches

Deterministic checks

Work when the output space is finite and enumerable: classification labels, extracted entities, structured JSON, single numeric answers.

  • Exact match / equals — brittle for free-text; a model saying “Positive” vs “positive” vs “The sentiment is positive” may all be correct
  • Substring / regex — fast, free, but can’t reason about semantics (“contains the word ‘safe’” does not mean the response conveys safety)
  • JSON schema validation — gates more specific field-level checks
  • Cost / latency guards — budget enforcement in CI

Pattern: Use deterministic checks as gates before more expensive semantic checks. If it’s not valid JSON, skip the field-level and semantic assertions.

Model-graded evals (LLM-as-judge)

A separate LLM (the “judge”) is prompted with [input, output, rubric] and asked to score or pass/fail the response. A rubric is a scoring guide — a written description of what “good” looks like, broken into specific criteria the judge checks against (e.g., “the answer must be factually correct, under 3 sentences, and not include information the user didn’t ask for”). This approximates human judgment at scale.

See LLM as Judge for the full treatment: known biases, rubric design, judge model selection, and the seminal research.

Embedding similarity

Embed both the model output and a reference response, then compute Cosine Similarity between the two vectors. BERTScore is a variant that does this comparison token-by-token and computes precision/recall/F1 across all tokens.

When useful: Summarization where wording doesn’t matter but meaning must be preserved; RAG faithfulness checks.

When misleading: Classification/extraction (semantic similarity to the gold answer doesn’t tell you if the model picked the right label); short outputs (1-3 words where embedding similarity is noisy); numerical divergence (“$100” vs “$10,000” can score highly similar).

Human evals

When necessary:

  • Novel task types where you don’t yet know what “correct” looks like
  • High-stakes decisions (medical, legal, financial)
  • Calibrating/auditing your automated judge
  • When LLM-judge agreement with humans is unknown for your domain

Measure Inter-Rater Reliability to verify raters actually agree beyond chance. Use Cohen’s Kappa () for two raters on categorical labels, or Krippendorff’s Alpha () for multiple raters on ordinal/continuous scales. Both output a number between 0 (agreement no better than chance) and 1 (perfect agreement). is conventionally “substantial agreement.” If , the rubric is underspecified and the human evals are noise.

Custom programmatic assertions (composite checks)

The most underused but most powerful pattern. Combine multiple assertion types into a single logical check:

format-gate → field-existence → field-value → semantic/domain logic

A model might return valid JSON (passes format gate) but with a hallucinated field name or out-of-range value. Layering catches each failure mode independently, and the failure output tells you which layer failed.

Eval design patterns

Handling non-determinism

StrategyTradeoff
temperature=0 + fixed seedNear-deterministic, but not all providers support it; doesn’t reflect production behavior
Run N times, check pass rateReflects real variance; multiplies cost
Statistical significance testingGold standard for comparing prompt versions; requires enough samples
Threshold-based pass rate”Passes if it passes 80% of N runs” — practical middle ground

For comparing two prompt versions statistically: with N=100 per condition and a real difference of 5 percentage points, you need ~800 examples to detect it at with 80% power. Don’t draw strong conclusions from N < 50 for close comparisons.

Building a good eval dataset

CategoryPurposeProportion
Happy path / typicalRepresentative of real production traffic~50%
Edge casesNear-boundary inputs, unusual formats, empty/null~20%
AdversarialInputs designed to trigger known failure modes~15%
Regression anchorsCases that previously failed and were fixed~15%

Where eval data should come from (ranked by quality):

  1. Real production logs (sanitized) — highest ecological validity
  2. Human-written examples by domain experts
  3. LLM-generated + human-reviewed
  4. LLM-generated only — lowest quality, high risk of systematic gaps

Minimum sample size: How many test cases do you need before you can trust your pass rate isn’t just noise? This depends on how precise you want the measurement to be.

The formula comes from the standard sample-size calculation for estimating a proportion (here, the pass rate) with a given confidence and margin of error:

Notation

  • — required number of test cases
  • — z-score for desired confidence level (1.96 for 95% confidence)
  • — estimated pass rate (use 0.5 = worst case, maximizes the required )
  • — acceptable margin of error (how close to the “true” pass rate you want to be)

With 95% confidence and 5% margin of error:

In plain terms: if your eval suite has 385 cases and you measure a 90% pass rate, the true pass rate is between 85% and 95% with 95% confidence. With only 50 cases, that same measurement gives you a margin of 14% — too noisy to detect a 5-point regression.

Versioning eval sets — avoiding “teaching to the test”

The anti-pattern: iteratively improving a prompt against the same 50 test cases. By iteration 10, the prompt is excellent at those 50 cases and may have degraded on everything else. This is Goodhart’s Law applied to evals.

Disciplines:

  • Frozen test split — a portion of the eval set is never used for prompt iteration. Run it only at milestone checkpoints.
  • Eval dataset versioning — store eval sets in git alongside prompt configs. Treat eval changes as code changes requiring review.
  • Regular refresh — every few weeks, add new examples from production traffic. Retire stale examples if the use case has changed.

Red-teaming and adversarial evals

Systematically trying to break the prompt/system, not just validating the happy path:

TypeDescription
Prompt injectionUser input attempting to override system prompt
JailbreaksSocial engineering to bypass safety constraints
Robustness to paraphraseSame question phrased 5 different ways — consistent answers?
Out-of-distributionLanguages, formats, domains the model wasn’t designed for
Hallucination stress testsQuestions designed to produce plausible-sounding wrong answers

Run adversarial evals separately from regression evals — they’re a different signal (security/safety vs. accuracy/helpfulness).

Known failure modes of eval systems

Failure modeDescriptionMitigation
Goodhart’s LawMetric becomes the target; model optimizes the eval, not the taskRotate eval sets; hold out a never-seen test split
Overfitting to eval setRepeated tuning against same N examples inflates scoresMaintain frozen “test” set; use separate “dev” set for iteration
Score inflation from verbosityLLM judges prefer longer answersScoring rubrics with explicit length-penalty criteria
Coverage illusion95% pass rate but all failures in one critical categoryStratify by task type; track per-category pass rates
Missing the regressionPrompt change improves one dimension, silently degrades anotherMulti-dimensional evals with composite scoring

Assertion selection by task type

Task typePrimary assertionSecondaryNotes
ClassificationExact match on labelLLM-rubric for borderline casesParse before asserting if output is verbose
ExtractionJSON schema + field checksFactualityNever use equals on whole JSON
SummarizationLLM-rubric with explicit rubricEmbedding similarityHard to test deterministically
Generation (freeform)LLM-rubric (tone, style, completeness)Contains-all for required elementsMost expensive to eval
Structured outputJSON → schema → field logicLatency / costComposite assertion pattern
RAG answersContext-faithfulnessAnswer-relevance
Code generationExecute + check outputSyntax markersRunning the code is the best test

”Good enough” vs over-engineering

Good enough for current stage:

  • Detects the most common regression types (wrong label, invalid format, missing required content)
  • Has test case per meaningful input category
  • Runs in < 5 minutes in CI
  • When you make a prompt change, the evals tell you something you didn’t already know
  • At least one automated assertion has been validated against human judgment

Over-engineering signals:

  • Spending more time maintaining evals than improving the product
  • Every test case requires a model-graded assertion (cost explosion)
  • 500 eval cases before 100 real users
  • Evals take > 20 minutes; people start skipping them

Heuristic: Evals should cost ~10-15% of LLM development time, not 50%.

See also

  • LLM as Judge — deep dive on model-graded evaluation, biases, rubric design
  • Promptfoo — open-source eval framework (CLI-first, YAML-driven)
  • Prompt Engineering — techniques for writing the prompts being evaluated
  • Sampling techniques — temperature and its impact on output determinism