Prompt flow

Overview

Prompt flow is a development toolkit created by Microsoft for building, testing, evaluating, and deploying LLM-based applications. Released in 2023 and integrated with Azure AI Studio, the framework provides visual workflow design, systematic evaluation capabilities, and production deployment tools. Prompt flow addresses the challenge of moving from experimental prompt engineering to production-ready LLM applications through structured workflows and comprehensive testing infrastructure.

The framework employs a Directed Acyclic Graph (DAG) architecture where applications are composed of connected nodes performing specific operations (LLM calls, Python functions, prompts). This modular approach enables visual design, systematic testing, version control, and performance optimization. Prompt flow supports both local development and cloud deployment with Azure integration.

Key technical components covered:

  • DAG architecture and node types
  • Flow definition and YAML structure
  • Evaluation framework and metrics
  • Prompt variants and A/B testing
  • Deployment options and serving
  • VS Code extension and debugging
  • Connections and LLM provider configuration
  • Tracing and observability integration
  • Version history and package structure

DAG Architecture and Node Types

Prompt flow implements applications as Directed Acyclic Graphs (DAGs) where nodes represent operations and edges represent data flow:

Flow definition structure:

$schema: https://azuremlschemas.azureedge.net/promptflow/latest/Flow.schema.json
 
inputs:
  review_text:
    type: string
 
outputs:
  sentiment:
    type: string
    reference: ${classify.output}
 
nodes:
- name: preprocess
  type: python
  source:
    type: code
    path: preprocess.py
  inputs:
    text: ${inputs.review_text}
 
- name: classify
  type: llm
  source:
    type: code
    path: classify.jinja2
  inputs:
    deployment_name: gpt-4
    temperature: 0.0
    processed_text: ${preprocess.output}
  connection: openai_connection
 
- name: format_output
  type: python
  source:
    type: code
    path: format.py
  inputs:
    classification: ${classify.output}

Node types available in Prompt flow:

LLM Node interacts with language models for text generation or processing:

- name: generate_response
  type: llm
  source:
    type: code
    path: prompt_template.jinja2
  inputs:
    deployment_name: gpt-4
    temperature: 0.7
    max_tokens: 500
    user_query: ${inputs.query}
  connection: openai_connection
  api: chat  # or 'completion'

Python Node executes custom Python code for data processing:

from promptflow import tool
 
@tool
def preprocess(text: str) -> str:
    """Clean and normalize input text"""
    text = text.strip().lower()
    text = re.sub(r'[^\w\s]', '', text)
    return text

Prompt Node defines reusable prompt templates using Jinja2:

system:
You are a sentiment analysis expert.
 
user:
Classify the sentiment of this review: "{{review_text}}"
 
Options: positive, negative, neutral
Classification:

Condition Node implements conditional logic controlling flow:

from promptflow import tool
 
@tool
def route_by_confidence(confidence: float) -> str:
    """Route to different paths based on confidence"""
    if confidence > 0.9:
        return "high_confidence_path"
    elif confidence > 0.5:
        return "medium_confidence_path"
    else:
        return "human_review_path"

Tool Node integrates external tools and APIs:

  • Web search tools
  • Database queries
  • Custom API calls
  • File operations

Aggregation Node processes collections of outputs computing summary statistics:

from typing import List
from promptflow import tool, log_metric
 
@tool
def aggregate_results(results: List[str]) -> dict:
    """Aggregate evaluation results"""
    total = len(results)
    correct = results.count("Correct")
    accuracy = correct / total if total > 0 else 0
    
    log_metric("accuracy", accuracy)
    
    return {
        "total": total,
        "correct": correct,
        "accuracy": accuracy
    }

Node execution follows topological order respecting dependencies. Flow executor manages:

  • Dependency resolution
  • Parallel execution where possible
  • Error propagation and handling
  • Output passing between nodes

Evaluation Framework and Metrics

Prompt flow provides comprehensive evaluation capabilities for systematic testing:

Evaluation flow structure:

$schema: https://azuremlschemas.azureedge.net/promptflow/latest/Flow.schema.json
 
inputs:
  prediction:
    type: string
  review_text:
    type: string
  ground_truth:
    type: string
 
outputs:
  accuracy:
    type: number
    reference: ${calculate_metrics.output.accuracy}
  detailed_metrics:
    type: object
    reference: ${calculate_metrics.output}
 
nodes:
- name: llm_judge
  type: python
  source:
    type: code
    path: llm_judge.py
  inputs:
    prediction: ${inputs.prediction}
    review_text: ${inputs.review_text}
 
- name: heuristic_check
  type: python
  source:
    type: code
    path: heuristic_eval.py
  inputs:
    prediction: ${inputs.prediction}
    review_text: ${inputs.review_text}
 
- name: ground_truth_check
  type: python
  source:
    type: code
    path: accuracy_check.py
  inputs:
    prediction: ${inputs.prediction}
    ground_truth: ${inputs.ground_truth}
 
- name: calculate_metrics
  type: python
  source:
    type: code
    path: aggregate_metrics.py
  inputs:
    judge_score: ${llm_judge.output}
    heuristic_score: ${heuristic_check.output}
    accuracy_score: ${ground_truth_check.output}

LLM-as-Judge implementation (no ground truth needed):

from promptflow import tool
from openai import OpenAI
 
@tool
def llm_as_judge(prediction: str, review_text: str) -> dict:
    """
    Use GPT-4 to evaluate sentiment classification quality
    
    Returns multi-dimensional quality scores
    """
    client = OpenAI()
    
    judge_prompt = f"""You are an expert evaluator of sentiment classifications.
 
Review Text: "{review_text}"
Predicted Sentiment: {prediction}
 
Evaluate on these dimensions (0-10 scale):
1. Accuracy: Does the classification match the review's actual sentiment?
2. Consistency: Would most humans agree with this classification?
3. Reasoning: Can you justify why this classification makes sense?
4. Nuance: Does it capture subtle sentiment indicators?
 
Respond in JSON:
{{
    "accuracy_score": 0-10,
    "consistency_score": 0-10,
    "reasoning_score": 0-10,
    "nuance_score": 0-10,
    "overall_quality": 0-10,
    "explanation": "detailed reasoning",
    "confidence": 0.0-1.0
}}
"""
 
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": judge_prompt}],
        response_format={"type": "json_object"},
        temperature=0.0
    )
    
    import json
    result = json.loads(response.choices[0].message.content)
    
    # Normalize to 0-1 scale
    return {
        "accuracy": result['accuracy_score'] / 10,
        "consistency": result['consistency_score'] / 10,
        "reasoning": result['reasoning_score'] / 10,
        "nuance": result['nuance_score'] / 10,
        "overall": result['overall_quality'] / 10,
        "confidence": result['confidence'],
        "explanation": result['explanation']
    }

Built-in evaluators in Azure AI Studio:

  • RelevanceEvaluator: Measures response relevance to query
  • CoherenceEvaluator: Assesses logical flow and coherence
  • FluencyEvaluator: Evaluates grammatical correctness and readability
  • GroundednessEvaluator: Checks if response grounded in provided context
  • SimilarityEvaluator: Compares semantic similarity to reference

Custom evaluators:

from promptflow import tool
 
@tool
def custom_sentiment_evaluator(
    prediction: str,
    review_text: str
) -> dict:
    """
    Custom evaluation logic without ground truth
    """
    # Keyword-based evaluation
    positive_keywords = ['amazing', 'excellent', 'love', 'best', 'great']
    negative_keywords = ['terrible', 'awful', 'worst', 'waste', 'broken']
    
    text_lower = review_text.lower()
    pos_count = sum(1 for w in positive_keywords if w in text_lower)
    neg_count = sum(1 for w in negative_keywords if w in text_lower)
    
    # Calculate alignment
    pred_lower = prediction.lower().strip()
    if pred_lower == 'positive':
        alignment = pos_count / (pos_count + neg_count + 1)
    elif pred_lower == 'negative':
        alignment = neg_count / (pos_count + neg_count + 1)
    else:  # neutral
        alignment = 1.0 if (pos_count == neg_count) else 0.5
    
    return {
        "keyword_alignment": alignment,
        "positive_signals": pos_count,
        "negative_signals": neg_count
    }

Batch evaluation execution:

from promptflow import PFClient
 
pf = PFClient()
 
# Run main flow
base_run = pf.run(
    flow="./sentiment_flow",
    data="reviews_test.jsonl",
    column_mapping={
        "review_text": "${data.review_text}"
    }
)
 
# Run evaluation flow
eval_run = pf.run(
    flow="./evaluation_flow",
    data="reviews_test.jsonl",
    run=base_run,
    column_mapping={
        "prediction": "${run.outputs.sentiment}",
        "review_text": "${data.review_text}",
        "ground_truth": "${data.ground_truth}"
    }
)
 
# Get aggregated metrics
metrics = pf.get_metrics(eval_run)
print(f"Overall Quality: {metrics['overall_quality']:.2f}")
print(f"Accuracy: {metrics['accuracy']:.2%}")

Prompt Variants and A/B Testing

Prompt variants enable systematic comparison of different prompt approaches:

Variant configuration in flow.yaml:

nodes:
  - name: classify_sentiment
    use_variants: true
 
node_variants:
  classify_sentiment:
    default_variant_id: simple
    variants:
      simple:
        node:
          type: llm
          source:
            type: code
            path: classify_simple.jinja2
          inputs:
            deployment_name: gpt-4
            temperature: 0.0
            review_text: ${inputs.review_text}
          connection: openai_connection
      
      few_shot:
        node:
          type: llm
          source:
            type: code
            path: classify_fewshot.jinja2
          inputs:
            deployment_name: gpt-4
            temperature: 0.0
            review_text: ${inputs.review_text}
          connection: openai_connection
      
      reasoning:
        node:
          type: llm
          source:
            type: code
            path: classify_reasoning.jinja2
          inputs:
            deployment_name: gpt-4
            temperature: 0.2
            review_text: ${inputs.review_text}
          connection: openai_connection

Prompt template examples:

classify_simple.jinja2:

system:
You are a sentiment classifier.
 
user:
Classify sentiment as: positive, negative, or neutral
 
Review: "{{review_text}}"
 
Sentiment:

classify_fewshot.jinja2:

system:
You are a sentiment classifier.
 
user:
Classify sentiment as: positive, negative, or neutral
 
Examples:
- "This is amazing!" → positive
- "It's okay." → neutral
- "Terrible quality." → negative
 
Review: "{{review_text}}"
 
Sentiment:

classify_reasoning.jinja2:

system:
You are an expert sentiment analyst.
 
user:
Analyze this review step-by-step:
 
Review: "{{review_text}}"
 
1. Identify key sentiment indicators (words, phrases, tone)
2. Weigh positive vs negative vs neutral signals
3. Classify as: positive, negative, or neutral
4. Provide confidence score (0-1)
 
Response format:
Indicators: [list key phrases]
Classification: [positive/negative/neutral]
Confidence: [0-1]

Running variant comparison:

from promptflow import PFClient
 
pf = PFClient()
 
# Run all variants
variant_runs = {}
for variant_name in ['simple', 'few_shot', 'reasoning']:
    run = pf.run(
        flow="./sentiment_flow",
        data="reviews_test.jsonl",
        variant="${classify_sentiment." + variant_name + "}",
        column_mapping={
            "review_text": "${data.review_text}"
        }
    )
    variant_runs[variant_name] = run
 
# Evaluate each variant
results = {}
for variant_name, run in variant_runs.items():
    eval_run = pf.run(
        flow="./evaluation_flow",
        data="reviews_test.jsonl",
        run=run,
        column_mapping={
            "prediction": "${run.outputs.sentiment}",
            "review_text": "${data.review_text}"
        }
    )
    
    results[variant_name] = pf.get_metrics(eval_run)
 
# Compare results
for variant, metrics in results.items():
    print(f"\n{variant}:")
    print(f"  Quality Score: {metrics['overall_quality']:.2f}")
    print(f"  Latency: {metrics['avg_latency']:.2f}s")
    print(f"  Cost: ${metrics['total_cost']:.2f}")

Comparative evaluation (no ground truth):

from promptflow import tool
from openai import OpenAI
 
@tool
def compare_variants(
    review_text: str,
    prediction_a: str,
    prediction_b: str
) -> dict:
    """Compare two prompt variants directly"""
    client = OpenAI()
    
    prompt = f"""Which classification is better for this review?
 
Review: "{review_text}"
 
Classification A: {prediction_a}
Classification B: {prediction_b}
 
Which better captures the sentiment? Respond in JSON:
{{
    "winner": "A" or "B" or "tie",
    "confidence": 0.0-1.0,
    "reasoning": "explanation"
}}
"""
    
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
        temperature=0.0
    )
    
    import json
    return json.loads(response.choices[0].message.content)

Deployment Options and Serving

Prompt flow supports multiple deployment patterns from local development to cloud production:

Local serving for development and testing:

# Start local server
pf flow serve --source ./my_flow --port 8080 --host localhost
 
# With FastAPI engine (newer, faster)
pf flow serve --source ./my_flow --port 8080 --engine fastapi
 
# Configure workers and threads
export PROMPTFLOW_WORKER_NUM=4
export PROMPTFLOW_THREADING_NUM=8
pf flow serve --source ./my_flow --port 8080

Local server exposes REST API:

# Test endpoint
curl http://localhost:8080/score \
  -X POST \
  -H "Content-Type: application/json" \
  -d '{"review_text": "This product is amazing!"}'

Docker containerization for portable deployment:

# Build flow as Docker image
pf flow build --source ./my_flow --output ./docker_build --format docker
 
# Dockerfile generated automatically
cd docker_build
docker build -t my-flow:latest .
 
# Run container
docker run -p 8080:8080 \
  -e OPENAI_API_KEY=sk-... \
  my-flow:latest

Azure App Service deployment:

# Build flow
pf flow build --source ./my_flow --output ./build --format docker
 
# Push to Azure Container Registry
az acr build --registry myregistry \
  --image my-flow:v1 \
  ./build
 
# Deploy to App Service
az webapp create \
  --resource-group mygroup \
  --plan myplan \
  --name my-flow-app \
  --deployment-container-image-name myregistry.azurecr.io/my-flow:v1

Azure AI Foundry managed endpoint:

from azure.ai.ml import MLClient
from azure.ai.ml.entities import ManagedOnlineEndpoint, ManagedOnlineDeployment
 
ml_client = MLClient(...)
 
# Create endpoint
endpoint = ManagedOnlineEndpoint(
    name="sentiment-classifier-endpoint",
    description="Sentiment classification service"
)
ml_client.online_endpoints.begin_create_or_update(endpoint).result()
 
# Deploy flow
deployment = ManagedOnlineDeployment(
    name="production",
    endpoint_name="sentiment-classifier-endpoint",
    flow_path="./sentiment_flow",
    instance_type="Standard_DS3_v2",
    instance_count=2
)
ml_client.online_deployments.begin_create_or_update(deployment).result()

Azure Kubernetes Service (AKS) for advanced orchestration:

# Kubernetes deployment manifest
apiVersion: apps/v1
kind: Deployment
metadata:
  name: promptflow-app
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: flow
        image: myregistry.azurecr.io/my-flow:v1
        ports:
        - containerPort: 8080
        env:
        - name: OPENAI_API_KEY
          valueFrom:
            secretKeyRef:
              name: openai-secret
              key: api-key

Deployment considerations:

  • Environment variables for API keys and configurations
  • Health check endpoints (/health)
  • Metrics endpoints (/metrics)
  • Scaling policies (horizontal pod autoscaling)
  • Load balancing and traffic management
  • Monitoring and logging integration

VS Code Extension and Debugging

Prompt Flow VS Code extension provides integrated development environment:

Installation:

  1. Install VS Code Python extension
  2. Install Prompt Flow extension from marketplace
  3. Set Python interpreter to appropriate environment

Visual flow designer opens from flow.dag.yaml:

  • Graphical node-and-edge representation
  • Drag-and-drop node creation
  • Visual connection drawing
  • Real-time validation of connections
  • Node configuration through UI

Debugging capabilities:

Breakpoint debugging in Python nodes:

from promptflow import tool
 
@tool
def complex_processing(data: str) -> dict:
    # Set breakpoint here
    intermediate = preprocess(data)
    
    # Inspect intermediate variable
    result = analyze(intermediate)
    
    return result

Set breakpoints in VS Code, run flow in debug mode, and inspect variables at each step.

Step-by-step execution:

  • Execute flow one node at a time
  • Pause after each node to inspect outputs
  • Continue or abort based on intermediate results
  • Modify inputs and re-run specific nodes

Interactive mode for chat flows:

# Launch interactive chat UI
pf flow test --flow ./chat_flow --interactive

Opens chat interface for testing conversational flows with real-time debugging.

Flow testing within VS Code:

  • Test entire flow from UI
  • Test individual nodes in isolation
  • View outputs inline
  • Trace execution path
  • Monitor token usage and latency

Trace visualization:

  • Timeline view of node execution
  • Token counts per LLM call
  • Latency breakdown by node
  • Input/output inspection
  • Error stack traces

Connections and LLM Provider Configuration

Connections manage credentials and endpoints for external services:

Creating OpenAI connection:

# openai_connection.yaml
$schema: https://azuremlschemas.azureedge.net/promptflow/latest/OpenAIConnection.schema.json
type: open_ai
name: openai_connection
api_key: sk-...
organization: org-...  # Optional
pf connection create --file openai_connection.yaml

Creating Azure OpenAI connection:

# azure_openai_connection.yaml
$schema: https://azuremlschemas.azureedge.net/promptflow/latest/AzureOpenAIConnection.schema.json
type: azure_open_ai
name: azure_openai_connection
api_key: your-api-key
api_base: https://your-resource.openai.azure.com/
api_type: azure
api_version: 2024-02-01
pf connection create --file azure_openai_connection.yaml

Using connections in flows:

nodes:
- name: chat
  type: llm
  source:
    type: code
    path: chat_prompt.jinja2
  inputs:
    deployment_name: gpt-4
    temperature: 0.7
  connection: openai_connection  # Reference connection by name

Environment variable support for security:

# connection.yaml with env vars
api_key: ${OPENAI_API_KEY}
api_base: ${AZURE_OPENAI_ENDPOINT}

Connection types supported:

  • OpenAI
  • Azure OpenAI
  • Cognitive Search
  • Serp API (web search)
  • Custom connections (arbitrary key-value pairs)

Managing connections:

# List connections
pf connection list
 
# Show connection details
pf connection show --name openai_connection
 
# Update connection
pf connection update --name openai_connection --set api_key=new-key
 
# Delete connection
pf connection delete --name openai_connection

Security best practices:

  • Store API keys in environment variables
  • Use Azure Key Vault for production
  • Never commit credentials to version control
  • Use managed identities where possible
  • Rotate keys regularly

Tracing and Observability Integration

Prompt flow integrates with OpenTelemetry for comprehensive observability:

Built-in tracing (enabled by default in v1.11+):

from promptflow import PFClient
 
pf = PFClient()
 
# Tracing automatically enabled
run = pf.run(
    flow="./my_flow",
    data="test.jsonl"
)
 
# View traces
pf.visualize(run)  # Opens UI with trace visualization

OpenTelemetry integration:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
 
# Configure OpenTelemetry
trace.set_tracer_provider(TracerProvider())
otlp_exporter = OTLPSpanExporter(endpoint="http://localhost:4317")
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(otlp_exporter)
)
 
# Prompt flow automatically emits OTLP traces

LangChain integration tracing:

import langwatch
from openinference.instrumentation.promptflow import PromptFlowInstrumentor
from promptflow import PFClient
 
# Setup tracing for Prompt flow + LangChain
langwatch.setup(
    instrumentors=[PromptFlowInstrumentor()]
)
 
pf = PFClient()
 
# All Prompt flow and LangChain operations traced
result = pf.run(
    flow="./langchain_flow",
    inputs={"query": "What is AI?"}
)

Trace data captured:

  • Node execution times and latency
  • LLM calls with token counts
  • Input/output values at each node
  • Error traces and stack traces
  • Custom metrics via log_metric()

Disabling tracing when not needed:

export PF_DISABLE_TRACING=true
pf run create --flow ./my_flow --data test.jsonl

Trace destinations:

  • Local filesystem (default)
  • Azure AI Studio (when using cloud)
  • Cosmos DB (for persistence)
  • Custom OTLP endpoints (Jaeger, Zipkin)

Trace visualization shows:

  • Flow execution timeline
  • Node dependencies and data flow
  • Token usage per LLM call
  • Cost breakdown
  • Bottleneck identification

Version History and Package Structure

v1.17.1 (January 13, 2025): Bug fixes for Marshmallow 3.24 compatibility. Default disables tracing (PF_DISABLE_TRACING=true).

v1.17.0 (January 8, 2025): Dropped Python 3.8 support for security. Fixed token counting issues in tracing.

v1.16.2 (November 25, 2024): Security vulnerability patches.

v1.16.1 (October 8, 2024): Token counting bug fixes for None values.

v1.16.0 (September 30, 2024): Fixed input logging in serving app.

v1.15.0 (August 15, 2024): Fixed connection issues for local-to-cloud runs. Improved trace view for boolean outputs.

v1.14.0 (July 25, 2024): Added promptflow to Dockerfile automatically. Removed docutils dependency.

v1.13.0 (June 28, 2024): Fixed trace exporter incompatibility. Added ARM token caching for local-to-cloud runs.

v1.12.0 (June 11, 2024): Fixed ChatUI in Docker containers. Added retry logic for cloud uploads. Trace usage telemetry.

v1.11.0 (May 17, 2024): Major feature release.

  • Flex Flow: Design apps with Python functions/classes flexibility
  • Prompty: Experimental feature for simplified prompt templates (.prompty files)
  • Trace upload: Local run details uploaded to cloud when configured
  • Serving engine: Added --engine parameter (flask vs fastapi)
  • Cosmos DB tracing: Refined setup with status monitoring

v1.10.0 (April 26, 2024): FastAPI serving engine support. Chat window UI (--ui flag). Search experience in trace UI.

v1.9.0 (April 17, 2024): Autocomplete for Linux. Trace experience in flow test and batch run.

v1.8.0 (April 10, 2024): Package restructuring.

  • Split into multiple packages:
    • promptflow-tracing: Tracing capability
    • promptflow-core: Core flow execution
    • promptflow-devkit: Development tools
    • promptflow-azure: Azure integration
  • resume_from feature for resuming failed runs

Package structure implications:

# Full installation
pip install promptflow[azure]
 
# Core only (no Azure)
pip install promptflow-core
 
# Development tools
pip install promptflow-devkit
 
# Tracing only
pip install promptflow-tracing

Backward compatibility: v1.8+ maintains API compatibility with earlier versions. Package split enables smaller dependency footprint for production deployments.