Prompt Evaluation
Prompt evaluation (“evals”) is the discipline of systematically testing whether LLM prompts produce correct, consistent, and useful outputs. It addresses a fundamental gap: traditional software testing assumes deterministic functions where the same input always yields the same output — LLMs violate every one of those assumptions.
Prerequisites
- Prompt Engineering — techniques for crafting prompts
- Basic understanding of Sampling techniques (temperature, top-p)
Why LLM evaluation is hard
| Property | Deterministic software | LLMs |
|---|---|---|
| Same input → same output | Always | Probabilistic (temperature > 0) |
| Output space | Finite, enumerable | Effectively unbounded natural language |
| ”Correct” answer | Objectively defined | Often subjective, context-dependent |
| Test oracle (= how you know the right answer) | Exact comparison | Requires semantic judgment |
| Failure signal | Stack trace / exception | Plausible-sounding wrong answer |
The most dangerous failure mode is the confident, fluent hallucination — the system produces an answer that looks correct and would pass a naive string check, but is factually wrong or task-inappropriate. Traditional unit tests have no mechanism to catch this.
Taxonomy of eval approaches
Deterministic checks
Work when the output space is finite and enumerable: classification labels, extracted entities, structured JSON, single numeric answers.
- Exact match / equals — brittle for free-text; a model saying “Positive” vs “positive” vs “The sentiment is positive” may all be correct
- Substring / regex — fast, free, but can’t reason about semantics (“contains the word ‘safe’” does not mean the response conveys safety)
- JSON schema validation — gates more specific field-level checks
- Cost / latency guards — budget enforcement in CI
Pattern: Use deterministic checks as gates before more expensive semantic checks. If it’s not valid JSON, skip the field-level and semantic assertions.
Model-graded evals (LLM-as-judge)
A separate LLM (the “judge”) is prompted with [input, output, rubric] and asked to score or pass/fail the response. A rubric is a scoring guide — a written description of what “good” looks like, broken into specific criteria the judge checks against (e.g., “the answer must be factually correct, under 3 sentences, and not include information the user didn’t ask for”). This approximates human judgment at scale.
See LLM as Judge for the full treatment: known biases, rubric design, judge model selection, and the seminal research.
Embedding similarity
Embed both the model output and a reference response, then compute Cosine Similarity between the two vectors. BERTScore is a variant that does this comparison token-by-token and computes precision/recall/F1 across all tokens.
When useful: Summarization where wording doesn’t matter but meaning must be preserved; RAG faithfulness checks.
When misleading: Classification/extraction (semantic similarity to the gold answer doesn’t tell you if the model picked the right label); short outputs (1-3 words where embedding similarity is noisy); numerical divergence (“$100” vs “$10,000” can score highly similar).
Human evals
When necessary:
- Novel task types where you don’t yet know what “correct” looks like
- High-stakes decisions (medical, legal, financial)
- Calibrating/auditing your automated judge
- When LLM-judge agreement with humans is unknown for your domain
Measure Inter-Rater Reliability to verify raters actually agree beyond chance. Use Cohen’s Kappa () for two raters on categorical labels, or Krippendorff’s Alpha () for multiple raters on ordinal/continuous scales. Both output a number between 0 (agreement no better than chance) and 1 (perfect agreement). is conventionally “substantial agreement.” If , the rubric is underspecified and the human evals are noise.
Custom programmatic assertions (composite checks)
The most underused but most powerful pattern. Combine multiple assertion types into a single logical check:
format-gate → field-existence → field-value → semantic/domain logic
A model might return valid JSON (passes format gate) but with a hallucinated field name or out-of-range value. Layering catches each failure mode independently, and the failure output tells you which layer failed.
Eval design patterns
Handling non-determinism
| Strategy | Tradeoff |
|---|---|
temperature=0 + fixed seed | Near-deterministic, but not all providers support it; doesn’t reflect production behavior |
| Run N times, check pass rate | Reflects real variance; multiplies cost |
| Statistical significance testing | Gold standard for comparing prompt versions; requires enough samples |
| Threshold-based pass rate | ”Passes if it passes 80% of N runs” — practical middle ground |
For comparing two prompt versions statistically: with N=100 per condition and a real difference of 5 percentage points, you need ~800 examples to detect it at with 80% power. Don’t draw strong conclusions from N < 50 for close comparisons.
Building a good eval dataset
| Category | Purpose | Proportion |
|---|---|---|
| Happy path / typical | Representative of real production traffic | ~50% |
| Edge cases | Near-boundary inputs, unusual formats, empty/null | ~20% |
| Adversarial | Inputs designed to trigger known failure modes | ~15% |
| Regression anchors | Cases that previously failed and were fixed | ~15% |
Where eval data should come from (ranked by quality):
- Real production logs (sanitized) — highest ecological validity
- Human-written examples by domain experts
- LLM-generated + human-reviewed
- LLM-generated only — lowest quality, high risk of systematic gaps
Minimum sample size: How many test cases do you need before you can trust your pass rate isn’t just noise? This depends on how precise you want the measurement to be.
The formula comes from the standard sample-size calculation for estimating a proportion (here, the pass rate) with a given confidence and margin of error:
Notation
- — required number of test cases
- — z-score for desired confidence level (1.96 for 95% confidence)
- — estimated pass rate (use 0.5 = worst case, maximizes the required )
- — acceptable margin of error (how close to the “true” pass rate you want to be)
With 95% confidence and 5% margin of error:
In plain terms: if your eval suite has 385 cases and you measure a 90% pass rate, the true pass rate is between 85% and 95% with 95% confidence. With only 50 cases, that same measurement gives you a margin of 14% — too noisy to detect a 5-point regression.
Versioning eval sets — avoiding “teaching to the test”
The anti-pattern: iteratively improving a prompt against the same 50 test cases. By iteration 10, the prompt is excellent at those 50 cases and may have degraded on everything else. This is Goodhart’s Law applied to evals.
Disciplines:
- Frozen test split — a portion of the eval set is never used for prompt iteration. Run it only at milestone checkpoints.
- Eval dataset versioning — store eval sets in git alongside prompt configs. Treat eval changes as code changes requiring review.
- Regular refresh — every few weeks, add new examples from production traffic. Retire stale examples if the use case has changed.
Red-teaming and adversarial evals
Systematically trying to break the prompt/system, not just validating the happy path:
| Type | Description |
|---|---|
| Prompt injection | User input attempting to override system prompt |
| Jailbreaks | Social engineering to bypass safety constraints |
| Robustness to paraphrase | Same question phrased 5 different ways — consistent answers? |
| Out-of-distribution | Languages, formats, domains the model wasn’t designed for |
| Hallucination stress tests | Questions designed to produce plausible-sounding wrong answers |
Run adversarial evals separately from regression evals — they’re a different signal (security/safety vs. accuracy/helpfulness).
Known failure modes of eval systems
| Failure mode | Description | Mitigation |
|---|---|---|
| Goodhart’s Law | Metric becomes the target; model optimizes the eval, not the task | Rotate eval sets; hold out a never-seen test split |
| Overfitting to eval set | Repeated tuning against same N examples inflates scores | Maintain frozen “test” set; use separate “dev” set for iteration |
| Score inflation from verbosity | LLM judges prefer longer answers | Scoring rubrics with explicit length-penalty criteria |
| Coverage illusion | 95% pass rate but all failures in one critical category | Stratify by task type; track per-category pass rates |
| Missing the regression | Prompt change improves one dimension, silently degrades another | Multi-dimensional evals with composite scoring |
Assertion selection by task type
| Task type | Primary assertion | Secondary | Notes |
|---|---|---|---|
| Classification | Exact match on label | LLM-rubric for borderline cases | Parse before asserting if output is verbose |
| Extraction | JSON schema + field checks | Factuality | Never use equals on whole JSON |
| Summarization | LLM-rubric with explicit rubric | Embedding similarity | Hard to test deterministically |
| Generation (freeform) | LLM-rubric (tone, style, completeness) | Contains-all for required elements | Most expensive to eval |
| Structured output | JSON → schema → field logic | Latency / cost | Composite assertion pattern |
| RAG answers | Context-faithfulness | Answer-relevance | |
| Code generation | Execute + check output | Syntax markers | Running the code is the best test |
”Good enough” vs over-engineering
Good enough for current stage:
- Detects the most common regression types (wrong label, invalid format, missing required content)
- Has test case per meaningful input category
- Runs in < 5 minutes in CI
- When you make a prompt change, the evals tell you something you didn’t already know
- At least one automated assertion has been validated against human judgment
Over-engineering signals:
- Spending more time maintaining evals than improving the product
- Every test case requires a model-graded assertion (cost explosion)
- 500 eval cases before 100 real users
- Evals take > 20 minutes; people start skipping them
Heuristic: Evals should cost ~10-15% of LLM development time, not 50%.
See also
- LLM as Judge — deep dive on model-graded evaluation, biases, rubric design
- Promptfoo — open-source eval framework (CLI-first, YAML-driven)
- Prompt Engineering — techniques for writing the prompts being evaluated
- Sampling techniques — temperature and its impact on output determinism