Promptfoo

Promptfoo is an open-source, CLI-first, YAML-driven eval framework for LLM prompts. It runs a matrix of (prompt x provider x test case), executes assertions on each output, and produces a pass/fail report.

Prerequisites

Prompt Evaluation — the concepts and patterns promptfoo implements

LLM as Judge — how model-graded assertions work under the hood

Architecture

The execution model is a three-dimensional matrix: prompts × providers × test cases. Define one prompt, three providers (e.g., GPT-4o-mini, Claude Haiku, Ollama llama3), and 50 test cases — promptfoo runs all 150 combinations, executes assertions on each output, and produces a pass/fail report (JSON for CI gates, HTML for human review).

The config file (promptfooconfig.yaml) declares four top-level sections:

prompts[] — template strings or file paths (with {{variable}} interpolation)
providers[] — which models to call (OpenAI, Anthropic, Ollama, Bedrock, etc.)
tests[] — each test case provides vars (template inputs) and assert[] (assertions to check)
defaultTest — shared assertions applied to every test case (e.g., latency guards)

Assertion types

Deterministic (no LLM call, free)

Type	Description
`equals`	Exact string match
`contains` / `icontains`	Substring (case-sensitive / insensitive)
`contains-any` / `contains-all`	At least one / all of a list
`regex`	Pattern match
`starts-with`	Prefix check
`is-json` / `contains-json`	Valid JSON validation
`latency`	Response time < threshold (ms)
`cost`	Inference cost < threshold ($)
`javascript`	Inline JS function returning bool/score
`python`	Inline or file-based Python returning bool/score

Model-graded (LLM call, costs tokens)

Type	Description
`llm-rubric`	Freeform rubric evaluated by judge LLM
`factuality`	Checks factual agreement with reference
`g-eval`	G-Eval (see below) — forces the judge to reason step-by-step before scoring
`answer-relevance`	Is the answer relevant to the question?
`similar`	Embedding cosine similarity $\geq$ threshold
`context-faithfulness`	(RAG) Answer is grounded in provided context
`context-relevance`	(RAG) Retrieved context is relevant to query

How LLM-as-judge works internally

For llm-rubric, promptfoo constructs a judge prompt:

You are an expert evaluator. Given the following:
  - Original prompt: {prompt}
  - Model output: {output}
  - Evaluation criterion: {rubric}

Does the output meet the criterion?
Respond with JSON: {"pass": true/false, "score": 0.0-1.0, "reason": "..."}

Key design choices:

The judge model is configurable via provider field on the assertion
You can override the judge prompt entirely via rubricPrompt
g-eval uses a different internal prompt (see G-Eval section below)
By default, promptfoo uses your configured defaultProvider as judge — override explicitly to avoid self-preference bias

assert:
  - type: llm-rubric
    value: "Response is professional and contains no profanity"
    provider: anthropic:claude-3-5-sonnet  # explicit judge model

Config example

# promptfooconfig.yaml
prompts:
  - "Classify the following as positive/negative/neutral: {{text}}"
 
providers:
  - openai:gpt-4o-mini
  - ollama:llama3.1:8b
 
defaultTest:
  assert:
    - type: latency
      threshold: 3000
 
tests:
  - vars:
      text: "I love this product!"
    assert:
      - type: icontains
        value: "positive"
  - vars:
      text: "This is the worst experience I've had"
    assert:
      - type: icontains
        value: "negative"

Composite assertion pattern

The most powerful pattern for structured output tasks:

assert:
  - type: is-json                          # gate: must be valid JSON
  - type: javascript
    value: |
      const parsed = JSON.parse(output);
      return (
        parsed.hasOwnProperty('status') &&
        ['approved', 'rejected', 'pending'].includes(parsed.status) &&
        typeof parsed.confidence === 'number' &&
        parsed.confidence >= 0 && parsed.confidence <= 1
      );

Pattern: format-gate → field-existence → field-value → semantic/domain logic. Each layer catches a different failure mode, and the output tells you which layer failed.

G-Eval: structured judging with reasoning steps

G-Eval (Liu et al., 2023) is a technique for making LLM-as-judge more reliable. The problem it solves: when you ask a judge model “rate this output 1-5,” the score is noisy — the model jumps to a number without systematic thinking.

G-Eval fixes this by structuring the judge prompt into explicit steps:

State the criteria — what dimensions are being evaluated (e.g., fluency, relevance, coherence)
List evaluation steps — force the judge to check each criterion one by one (“Step 1: check if the answer addresses the question. Step 2: check if it introduces unsupported claims…“)
Generate reasoning — the judge writes out its thinking before scoring
Produce final score — only after reasoning is complete

This is essentially Chain-of-Thought (= prompting the model to show its work before giving a final answer) applied to evaluation. The reasoning step reduces score variance across runs because the model anchors on its own explicit analysis rather than producing a gut-feeling number.

In promptfoo, the g-eval assertion type handles this automatically — you provide the criteria, and it constructs the multi-step judge prompt internally.

Repeated runs for non-determinism

# Run each test case 3 times, pass if >= 2/3 succeed
defaultTest:
  options:
    numRepetitions: 3
    passThreshold: 0.66

Red-teaming

Promptfoo has first-class red-teaming support — 142+ plugins covering prompt injection, jailbreaks, PII leakage, toxicity, and the OWASP LLM Top 10. This is one of its strongest differentiators over other eval tools.

CI/CD integration

Promptfoo is designed to run in CI pipelines as a quality gate:

# GitLab CI example
prompt-eval:
  stage: test
  rules:
    - changes:
        - prompts/**
        - promptfooconfig.yaml
  script:
    - npx promptfoo@latest eval -c promptfooconfig.yaml -o results.json
    - |
      FAILURES=$(jq '.results.stats.failures' results.json)
      [ "$FAILURES" -gt 0 ] && exit 1
  artifacts:
    paths: [results.json]

Best practices:

Only trigger on paths that change prompts or config
Cache promptfoo’s result cache to avoid redundant API calls
Export JSON (machine-readable for quality gate) and HTML (human-readable artifact)
Set --max-failures N for fast-fail on catastrophic regressions

Comparison with alternatives

Dimension	promptfoo	LangSmith	Braintrust	DeepEval	Ragas
Primary abstraction	YAML eval config	Trace + dataset	Experiment (code-centric)	pytest suite	RAG metrics library
Interface	CLI + web UI	Web UI + SDK	Web UI + SDK	Python API	Python API
Tracing/observability	No (eval only)	Core feature	Core feature	No	No
Open source	Full	Partial	Partial	Full	Full
Red-teaming	First-class (142+ plugins)	No	No	Limited	No
RAG-specific metrics	Partial	Partial	Partial	Good	Best-in-class
CI/CD native	First-class	Possible	Good	Via pytest	Possible
Multi-model comparison	Built-in (matrix evals)	Manual	Manual	Manual	No

When to choose promptfoo: Prompt iteration, security red-teaming, multi-model comparison, CI/CD quality gates. Best for teams that want eval-as-code without a vendor platform.

When to choose something else:

Deep in the LangChain ecosystem → LangSmith (native tracing + eval)
Need production observability + eval in one tool → Braintrust
Python/pytest shop wanting minimal config → DeepEval
Primary concern is RAG pipeline quality → Ragas

Edmondo's Vault

Explorer

Promptfoo

Promptfoo

Architecture

Assertion types

Deterministic (no LLM call, free)

Model-graded (LLM call, costs tokens)

How LLM-as-judge works internally

Config example

Composite assertion pattern

G-Eval: structured judging with reasoning steps

Repeated runs for non-determinism

Red-teaming

CI/CD integration

Comparison with alternatives

See also

Graph View

Table of Contents

Backlinks