Promptfoo

Promptfoo is an open-source, CLI-first, YAML-driven eval framework for LLM prompts. It runs a matrix of (prompt x provider x test case), executes assertions on each output, and produces a pass/fail report.

Prerequisites

Architecture

The execution model is a three-dimensional matrix: prompts × providers × test cases. Define one prompt, three providers (e.g., GPT-4o-mini, Claude Haiku, Ollama llama3), and 50 test cases — promptfoo runs all 150 combinations, executes assertions on each output, and produces a pass/fail report (JSON for CI gates, HTML for human review).

The config file (promptfooconfig.yaml) declares four top-level sections:

  • prompts[] — template strings or file paths (with {{variable}} interpolation)
  • providers[] — which models to call (OpenAI, Anthropic, Ollama, Bedrock, etc.)
  • tests[] — each test case provides vars (template inputs) and assert[] (assertions to check)
  • defaultTest — shared assertions applied to every test case (e.g., latency guards)

Assertion types

Deterministic (no LLM call, free)

TypeDescription
equalsExact string match
contains / icontainsSubstring (case-sensitive / insensitive)
contains-any / contains-allAt least one / all of a list
regexPattern match
starts-withPrefix check
is-json / contains-jsonValid JSON validation
latencyResponse time < threshold (ms)
costInference cost < threshold ($)
javascriptInline JS function returning bool/score
pythonInline or file-based Python returning bool/score

Model-graded (LLM call, costs tokens)

TypeDescription
llm-rubricFreeform rubric evaluated by judge LLM
factualityChecks factual agreement with reference
g-evalG-Eval (see below) — forces the judge to reason step-by-step before scoring
answer-relevanceIs the answer relevant to the question?
similarEmbedding cosine similarity threshold
context-faithfulness(RAG) Answer is grounded in provided context
context-relevance(RAG) Retrieved context is relevant to query

How LLM-as-judge works internally

For llm-rubric, promptfoo constructs a judge prompt:

You are an expert evaluator. Given the following:
  - Original prompt: {prompt}
  - Model output: {output}
  - Evaluation criterion: {rubric}

Does the output meet the criterion?
Respond with JSON: {"pass": true/false, "score": 0.0-1.0, "reason": "..."}

Key design choices:

  • The judge model is configurable via provider field on the assertion
  • You can override the judge prompt entirely via rubricPrompt
  • g-eval uses a different internal prompt (see G-Eval section below)
  • By default, promptfoo uses your configured defaultProvider as judge — override explicitly to avoid self-preference bias
assert:
  - type: llm-rubric
    value: "Response is professional and contains no profanity"
    provider: anthropic:claude-3-5-sonnet  # explicit judge model

Config example

# promptfooconfig.yaml
prompts:
  - "Classify the following as positive/negative/neutral: {{text}}"
 
providers:
  - openai:gpt-4o-mini
  - ollama:llama3.1:8b
 
defaultTest:
  assert:
    - type: latency
      threshold: 3000
 
tests:
  - vars:
      text: "I love this product!"
    assert:
      - type: icontains
        value: "positive"
  - vars:
      text: "This is the worst experience I've had"
    assert:
      - type: icontains
        value: "negative"

Composite assertion pattern

The most powerful pattern for structured output tasks:

assert:
  - type: is-json                          # gate: must be valid JSON
  - type: javascript
    value: |
      const parsed = JSON.parse(output);
      return (
        parsed.hasOwnProperty('status') &&
        ['approved', 'rejected', 'pending'].includes(parsed.status) &&
        typeof parsed.confidence === 'number' &&
        parsed.confidence >= 0 && parsed.confidence <= 1
      );

Pattern: format-gate → field-existence → field-value → semantic/domain logic. Each layer catches a different failure mode, and the output tells you which layer failed.

G-Eval: structured judging with reasoning steps

G-Eval (Liu et al., 2023) is a technique for making LLM-as-judge more reliable. The problem it solves: when you ask a judge model “rate this output 1-5,” the score is noisy — the model jumps to a number without systematic thinking.

G-Eval fixes this by structuring the judge prompt into explicit steps:

  1. State the criteria — what dimensions are being evaluated (e.g., fluency, relevance, coherence)
  2. List evaluation steps — force the judge to check each criterion one by one (“Step 1: check if the answer addresses the question. Step 2: check if it introduces unsupported claims…“)
  3. Generate reasoning — the judge writes out its thinking before scoring
  4. Produce final score — only after reasoning is complete

This is essentially Chain-of-Thought (= prompting the model to show its work before giving a final answer) applied to evaluation. The reasoning step reduces score variance across runs because the model anchors on its own explicit analysis rather than producing a gut-feeling number.

In promptfoo, the g-eval assertion type handles this automatically — you provide the criteria, and it constructs the multi-step judge prompt internally.

Repeated runs for non-determinism

# Run each test case 3 times, pass if >= 2/3 succeed
defaultTest:
  options:
    numRepetitions: 3
    passThreshold: 0.66

Red-teaming

Promptfoo has first-class red-teaming support — 142+ plugins covering prompt injection, jailbreaks, PII leakage, toxicity, and the OWASP LLM Top 10. This is one of its strongest differentiators over other eval tools.

CI/CD integration

Promptfoo is designed to run in CI pipelines as a quality gate:

# GitLab CI example
prompt-eval:
  stage: test
  rules:
    - changes:
        - prompts/**
        - promptfooconfig.yaml
  script:
    - npx promptfoo@latest eval -c promptfooconfig.yaml -o results.json
    - |
      FAILURES=$(jq '.results.stats.failures' results.json)
      [ "$FAILURES" -gt 0 ] && exit 1
  artifacts:
    paths: [results.json]

Best practices:

  • Only trigger on paths that change prompts or config
  • Cache promptfoo’s result cache to avoid redundant API calls
  • Export JSON (machine-readable for quality gate) and HTML (human-readable artifact)
  • Set --max-failures N for fast-fail on catastrophic regressions

Comparison with alternatives

DimensionpromptfooLangSmithBraintrustDeepEvalRagas
Primary abstractionYAML eval configTrace + datasetExperiment (code-centric)pytest suiteRAG metrics library
InterfaceCLI + web UIWeb UI + SDKWeb UI + SDKPython APIPython API
Tracing/observabilityNo (eval only)Core featureCore featureNoNo
Open sourceFullPartialPartialFullFull
Red-teamingFirst-class (142+ plugins)NoNoLimitedNo
RAG-specific metricsPartialPartialPartialGoodBest-in-class
CI/CD nativeFirst-classPossibleGoodVia pytestPossible
Multi-model comparisonBuilt-in (matrix evals)ManualManualManualNo

When to choose promptfoo: Prompt iteration, security red-teaming, multi-model comparison, CI/CD quality gates. Best for teams that want eval-as-code without a vendor platform.

When to choose something else:

  • Deep in the LangChain ecosystem → LangSmith (native tracing + eval)
  • Need production observability + eval in one tool → Braintrust
  • Python/pytest shop wanting minimal config → DeepEval
  • Primary concern is RAG pipeline quality → Ragas

See also