Promptfoo
Promptfoo is an open-source, CLI-first, YAML-driven eval framework for LLM prompts. It runs a matrix of (prompt x provider x test case), executes assertions on each output, and produces a pass/fail report.
Prerequisites
- Prompt Evaluation — the concepts and patterns promptfoo implements
- LLM as Judge — how model-graded assertions work under the hood
Architecture
The execution model is a three-dimensional matrix: prompts × providers × test cases. Define one prompt, three providers (e.g., GPT-4o-mini, Claude Haiku, Ollama llama3), and 50 test cases — promptfoo runs all 150 combinations, executes assertions on each output, and produces a pass/fail report (JSON for CI gates, HTML for human review).
The config file (promptfooconfig.yaml) declares four top-level sections:
prompts[]— template strings or file paths (with{{variable}}interpolation)providers[]— which models to call (OpenAI, Anthropic, Ollama, Bedrock, etc.)tests[]— each test case providesvars(template inputs) andassert[](assertions to check)defaultTest— shared assertions applied to every test case (e.g., latency guards)
Assertion types
Deterministic (no LLM call, free)
| Type | Description |
|---|---|
equals | Exact string match |
contains / icontains | Substring (case-sensitive / insensitive) |
contains-any / contains-all | At least one / all of a list |
regex | Pattern match |
starts-with | Prefix check |
is-json / contains-json | Valid JSON validation |
latency | Response time < threshold (ms) |
cost | Inference cost < threshold ($) |
javascript | Inline JS function returning bool/score |
python | Inline or file-based Python returning bool/score |
Model-graded (LLM call, costs tokens)
| Type | Description |
|---|---|
llm-rubric | Freeform rubric evaluated by judge LLM |
factuality | Checks factual agreement with reference |
g-eval | G-Eval (see below) — forces the judge to reason step-by-step before scoring |
answer-relevance | Is the answer relevant to the question? |
similar | Embedding cosine similarity threshold |
context-faithfulness | (RAG) Answer is grounded in provided context |
context-relevance | (RAG) Retrieved context is relevant to query |
How LLM-as-judge works internally
For llm-rubric, promptfoo constructs a judge prompt:
You are an expert evaluator. Given the following:
- Original prompt: {prompt}
- Model output: {output}
- Evaluation criterion: {rubric}
Does the output meet the criterion?
Respond with JSON: {"pass": true/false, "score": 0.0-1.0, "reason": "..."}
Key design choices:
- The judge model is configurable via
providerfield on the assertion - You can override the judge prompt entirely via
rubricPrompt g-evaluses a different internal prompt (see G-Eval section below)- By default, promptfoo uses your configured
defaultProvideras judge — override explicitly to avoid self-preference bias
assert:
- type: llm-rubric
value: "Response is professional and contains no profanity"
provider: anthropic:claude-3-5-sonnet # explicit judge modelConfig example
# promptfooconfig.yaml
prompts:
- "Classify the following as positive/negative/neutral: {{text}}"
providers:
- openai:gpt-4o-mini
- ollama:llama3.1:8b
defaultTest:
assert:
- type: latency
threshold: 3000
tests:
- vars:
text: "I love this product!"
assert:
- type: icontains
value: "positive"
- vars:
text: "This is the worst experience I've had"
assert:
- type: icontains
value: "negative"Composite assertion pattern
The most powerful pattern for structured output tasks:
assert:
- type: is-json # gate: must be valid JSON
- type: javascript
value: |
const parsed = JSON.parse(output);
return (
parsed.hasOwnProperty('status') &&
['approved', 'rejected', 'pending'].includes(parsed.status) &&
typeof parsed.confidence === 'number' &&
parsed.confidence >= 0 && parsed.confidence <= 1
);Pattern: format-gate → field-existence → field-value → semantic/domain logic. Each layer catches a different failure mode, and the output tells you which layer failed.
G-Eval: structured judging with reasoning steps
G-Eval (Liu et al., 2023) is a technique for making LLM-as-judge more reliable. The problem it solves: when you ask a judge model “rate this output 1-5,” the score is noisy — the model jumps to a number without systematic thinking.
G-Eval fixes this by structuring the judge prompt into explicit steps:
- State the criteria — what dimensions are being evaluated (e.g., fluency, relevance, coherence)
- List evaluation steps — force the judge to check each criterion one by one (“Step 1: check if the answer addresses the question. Step 2: check if it introduces unsupported claims…“)
- Generate reasoning — the judge writes out its thinking before scoring
- Produce final score — only after reasoning is complete
This is essentially Chain-of-Thought (= prompting the model to show its work before giving a final answer) applied to evaluation. The reasoning step reduces score variance across runs because the model anchors on its own explicit analysis rather than producing a gut-feeling number.
In promptfoo, the g-eval assertion type handles this automatically — you provide the criteria, and it constructs the multi-step judge prompt internally.
Repeated runs for non-determinism
# Run each test case 3 times, pass if >= 2/3 succeed
defaultTest:
options:
numRepetitions: 3
passThreshold: 0.66Red-teaming
Promptfoo has first-class red-teaming support — 142+ plugins covering prompt injection, jailbreaks, PII leakage, toxicity, and the OWASP LLM Top 10. This is one of its strongest differentiators over other eval tools.
CI/CD integration
Promptfoo is designed to run in CI pipelines as a quality gate:
# GitLab CI example
prompt-eval:
stage: test
rules:
- changes:
- prompts/**
- promptfooconfig.yaml
script:
- npx promptfoo@latest eval -c promptfooconfig.yaml -o results.json
- |
FAILURES=$(jq '.results.stats.failures' results.json)
[ "$FAILURES" -gt 0 ] && exit 1
artifacts:
paths: [results.json]Best practices:
- Only trigger on paths that change prompts or config
- Cache promptfoo’s result cache to avoid redundant API calls
- Export JSON (machine-readable for quality gate) and HTML (human-readable artifact)
- Set
--max-failures Nfor fast-fail on catastrophic regressions
Comparison with alternatives
| Dimension | promptfoo | LangSmith | Braintrust | DeepEval | Ragas |
|---|---|---|---|---|---|
| Primary abstraction | YAML eval config | Trace + dataset | Experiment (code-centric) | pytest suite | RAG metrics library |
| Interface | CLI + web UI | Web UI + SDK | Web UI + SDK | Python API | Python API |
| Tracing/observability | No (eval only) | Core feature | Core feature | No | No |
| Open source | Full | Partial | Partial | Full | Full |
| Red-teaming | First-class (142+ plugins) | No | No | Limited | No |
| RAG-specific metrics | Partial | Partial | Partial | Good | Best-in-class |
| CI/CD native | First-class | Possible | Good | Via pytest | Possible |
| Multi-model comparison | Built-in (matrix evals) | Manual | Manual | Manual | No |
When to choose promptfoo: Prompt iteration, security red-teaming, multi-model comparison, CI/CD quality gates. Best for teams that want eval-as-code without a vendor platform.
When to choose something else:
- Deep in the LangChain ecosystem → LangSmith (native tracing + eval)
- Need production observability + eval in one tool → Braintrust
- Python/pytest shop wanting minimal config → DeepEval
- Primary concern is RAG pipeline quality → Ragas
See also
- Prompt Evaluation — the concepts and design patterns
- LLM as Judge — how model-graded assertions work, biases to watch for
- Prompt Engineering — writing the prompts that promptfoo evaluates