Prompt flow
Overview
Prompt flow is a development toolkit created by Microsoft for building, testing, evaluating, and deploying LLM-based applications. Released in 2023 and integrated with Azure AI Studio, the framework provides visual workflow design, systematic evaluation capabilities, and production deployment tools. Prompt flow addresses the challenge of moving from experimental prompt engineering to production-ready LLM applications through structured workflows and comprehensive testing infrastructure.
The framework employs a Directed Acyclic Graph (DAG) architecture where applications are composed of connected nodes performing specific operations (LLM calls, Python functions, prompts). This modular approach enables visual design, systematic testing, version control, and performance optimization. Prompt flow supports both local development and cloud deployment with Azure integration.
Key technical components covered:
- DAG architecture and node types
- Flow definition and YAML structure
- Evaluation framework and metrics
- Prompt variants and A/B testing
- Deployment options and serving
- VS Code extension and debugging
- Connections and LLM provider configuration
- Tracing and observability integration
- Version history and package structure
DAG Architecture and Node Types
Prompt flow implements applications as Directed Acyclic Graphs (DAGs) where nodes represent operations and edges represent data flow:
Flow definition structure:
$schema: https://azuremlschemas.azureedge.net/promptflow/latest/Flow.schema.json
inputs:
review_text:
type: string
outputs:
sentiment:
type: string
reference: ${classify.output}
nodes:
- name: preprocess
type: python
source:
type: code
path: preprocess.py
inputs:
text: ${inputs.review_text}
- name: classify
type: llm
source:
type: code
path: classify.jinja2
inputs:
deployment_name: gpt-4
temperature: 0.0
processed_text: ${preprocess.output}
connection: openai_connection
- name: format_output
type: python
source:
type: code
path: format.py
inputs:
classification: ${classify.output}Node types available in Prompt flow:
LLM Node interacts with language models for text generation or processing:
- name: generate_response
type: llm
source:
type: code
path: prompt_template.jinja2
inputs:
deployment_name: gpt-4
temperature: 0.7
max_tokens: 500
user_query: ${inputs.query}
connection: openai_connection
api: chat # or 'completion'Python Node executes custom Python code for data processing:
from promptflow import tool
@tool
def preprocess(text: str) -> str:
"""Clean and normalize input text"""
text = text.strip().lower()
text = re.sub(r'[^\w\s]', '', text)
return textPrompt Node defines reusable prompt templates using Jinja2:
system:
You are a sentiment analysis expert.
user:
Classify the sentiment of this review: "{{review_text}}"
Options: positive, negative, neutral
Classification:Condition Node implements conditional logic controlling flow:
from promptflow import tool
@tool
def route_by_confidence(confidence: float) -> str:
"""Route to different paths based on confidence"""
if confidence > 0.9:
return "high_confidence_path"
elif confidence > 0.5:
return "medium_confidence_path"
else:
return "human_review_path"Tool Node integrates external tools and APIs:
- Web search tools
- Database queries
- Custom API calls
- File operations
Aggregation Node processes collections of outputs computing summary statistics:
from typing import List
from promptflow import tool, log_metric
@tool
def aggregate_results(results: List[str]) -> dict:
"""Aggregate evaluation results"""
total = len(results)
correct = results.count("Correct")
accuracy = correct / total if total > 0 else 0
log_metric("accuracy", accuracy)
return {
"total": total,
"correct": correct,
"accuracy": accuracy
}Node execution follows topological order respecting dependencies. Flow executor manages:
- Dependency resolution
- Parallel execution where possible
- Error propagation and handling
- Output passing between nodes
Evaluation Framework and Metrics
Prompt flow provides comprehensive evaluation capabilities for systematic testing:
Evaluation flow structure:
$schema: https://azuremlschemas.azureedge.net/promptflow/latest/Flow.schema.json
inputs:
prediction:
type: string
review_text:
type: string
ground_truth:
type: string
outputs:
accuracy:
type: number
reference: ${calculate_metrics.output.accuracy}
detailed_metrics:
type: object
reference: ${calculate_metrics.output}
nodes:
- name: llm_judge
type: python
source:
type: code
path: llm_judge.py
inputs:
prediction: ${inputs.prediction}
review_text: ${inputs.review_text}
- name: heuristic_check
type: python
source:
type: code
path: heuristic_eval.py
inputs:
prediction: ${inputs.prediction}
review_text: ${inputs.review_text}
- name: ground_truth_check
type: python
source:
type: code
path: accuracy_check.py
inputs:
prediction: ${inputs.prediction}
ground_truth: ${inputs.ground_truth}
- name: calculate_metrics
type: python
source:
type: code
path: aggregate_metrics.py
inputs:
judge_score: ${llm_judge.output}
heuristic_score: ${heuristic_check.output}
accuracy_score: ${ground_truth_check.output}LLM-as-Judge implementation (no ground truth needed):
from promptflow import tool
from openai import OpenAI
@tool
def llm_as_judge(prediction: str, review_text: str) -> dict:
"""
Use GPT-4 to evaluate sentiment classification quality
Returns multi-dimensional quality scores
"""
client = OpenAI()
judge_prompt = f"""You are an expert evaluator of sentiment classifications.
Review Text: "{review_text}"
Predicted Sentiment: {prediction}
Evaluate on these dimensions (0-10 scale):
1. Accuracy: Does the classification match the review's actual sentiment?
2. Consistency: Would most humans agree with this classification?
3. Reasoning: Can you justify why this classification makes sense?
4. Nuance: Does it capture subtle sentiment indicators?
Respond in JSON:
{{
"accuracy_score": 0-10,
"consistency_score": 0-10,
"reasoning_score": 0-10,
"nuance_score": 0-10,
"overall_quality": 0-10,
"explanation": "detailed reasoning",
"confidence": 0.0-1.0
}}
"""
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": judge_prompt}],
response_format={"type": "json_object"},
temperature=0.0
)
import json
result = json.loads(response.choices[0].message.content)
# Normalize to 0-1 scale
return {
"accuracy": result['accuracy_score'] / 10,
"consistency": result['consistency_score'] / 10,
"reasoning": result['reasoning_score'] / 10,
"nuance": result['nuance_score'] / 10,
"overall": result['overall_quality'] / 10,
"confidence": result['confidence'],
"explanation": result['explanation']
}Built-in evaluators in Azure AI Studio:
RelevanceEvaluator: Measures response relevance to queryCoherenceEvaluator: Assesses logical flow and coherenceFluencyEvaluator: Evaluates grammatical correctness and readabilityGroundednessEvaluator: Checks if response grounded in provided contextSimilarityEvaluator: Compares semantic similarity to reference
Custom evaluators:
from promptflow import tool
@tool
def custom_sentiment_evaluator(
prediction: str,
review_text: str
) -> dict:
"""
Custom evaluation logic without ground truth
"""
# Keyword-based evaluation
positive_keywords = ['amazing', 'excellent', 'love', 'best', 'great']
negative_keywords = ['terrible', 'awful', 'worst', 'waste', 'broken']
text_lower = review_text.lower()
pos_count = sum(1 for w in positive_keywords if w in text_lower)
neg_count = sum(1 for w in negative_keywords if w in text_lower)
# Calculate alignment
pred_lower = prediction.lower().strip()
if pred_lower == 'positive':
alignment = pos_count / (pos_count + neg_count + 1)
elif pred_lower == 'negative':
alignment = neg_count / (pos_count + neg_count + 1)
else: # neutral
alignment = 1.0 if (pos_count == neg_count) else 0.5
return {
"keyword_alignment": alignment,
"positive_signals": pos_count,
"negative_signals": neg_count
}Batch evaluation execution:
from promptflow import PFClient
pf = PFClient()
# Run main flow
base_run = pf.run(
flow="./sentiment_flow",
data="reviews_test.jsonl",
column_mapping={
"review_text": "${data.review_text}"
}
)
# Run evaluation flow
eval_run = pf.run(
flow="./evaluation_flow",
data="reviews_test.jsonl",
run=base_run,
column_mapping={
"prediction": "${run.outputs.sentiment}",
"review_text": "${data.review_text}",
"ground_truth": "${data.ground_truth}"
}
)
# Get aggregated metrics
metrics = pf.get_metrics(eval_run)
print(f"Overall Quality: {metrics['overall_quality']:.2f}")
print(f"Accuracy: {metrics['accuracy']:.2%}")Prompt Variants and A/B Testing
Prompt variants enable systematic comparison of different prompt approaches:
Variant configuration in flow.yaml:
nodes:
- name: classify_sentiment
use_variants: true
node_variants:
classify_sentiment:
default_variant_id: simple
variants:
simple:
node:
type: llm
source:
type: code
path: classify_simple.jinja2
inputs:
deployment_name: gpt-4
temperature: 0.0
review_text: ${inputs.review_text}
connection: openai_connection
few_shot:
node:
type: llm
source:
type: code
path: classify_fewshot.jinja2
inputs:
deployment_name: gpt-4
temperature: 0.0
review_text: ${inputs.review_text}
connection: openai_connection
reasoning:
node:
type: llm
source:
type: code
path: classify_reasoning.jinja2
inputs:
deployment_name: gpt-4
temperature: 0.2
review_text: ${inputs.review_text}
connection: openai_connectionPrompt template examples:
classify_simple.jinja2:
system:
You are a sentiment classifier.
user:
Classify sentiment as: positive, negative, or neutral
Review: "{{review_text}}"
Sentiment:classify_fewshot.jinja2:
system:
You are a sentiment classifier.
user:
Classify sentiment as: positive, negative, or neutral
Examples:
- "This is amazing!" → positive
- "It's okay." → neutral
- "Terrible quality." → negative
Review: "{{review_text}}"
Sentiment:classify_reasoning.jinja2:
system:
You are an expert sentiment analyst.
user:
Analyze this review step-by-step:
Review: "{{review_text}}"
1. Identify key sentiment indicators (words, phrases, tone)
2. Weigh positive vs negative vs neutral signals
3. Classify as: positive, negative, or neutral
4. Provide confidence score (0-1)
Response format:
Indicators: [list key phrases]
Classification: [positive/negative/neutral]
Confidence: [0-1]Running variant comparison:
from promptflow import PFClient
pf = PFClient()
# Run all variants
variant_runs = {}
for variant_name in ['simple', 'few_shot', 'reasoning']:
run = pf.run(
flow="./sentiment_flow",
data="reviews_test.jsonl",
variant="${classify_sentiment." + variant_name + "}",
column_mapping={
"review_text": "${data.review_text}"
}
)
variant_runs[variant_name] = run
# Evaluate each variant
results = {}
for variant_name, run in variant_runs.items():
eval_run = pf.run(
flow="./evaluation_flow",
data="reviews_test.jsonl",
run=run,
column_mapping={
"prediction": "${run.outputs.sentiment}",
"review_text": "${data.review_text}"
}
)
results[variant_name] = pf.get_metrics(eval_run)
# Compare results
for variant, metrics in results.items():
print(f"\n{variant}:")
print(f" Quality Score: {metrics['overall_quality']:.2f}")
print(f" Latency: {metrics['avg_latency']:.2f}s")
print(f" Cost: ${metrics['total_cost']:.2f}")Comparative evaluation (no ground truth):
from promptflow import tool
from openai import OpenAI
@tool
def compare_variants(
review_text: str,
prediction_a: str,
prediction_b: str
) -> dict:
"""Compare two prompt variants directly"""
client = OpenAI()
prompt = f"""Which classification is better for this review?
Review: "{review_text}"
Classification A: {prediction_a}
Classification B: {prediction_b}
Which better captures the sentiment? Respond in JSON:
{{
"winner": "A" or "B" or "tie",
"confidence": 0.0-1.0,
"reasoning": "explanation"
}}
"""
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"},
temperature=0.0
)
import json
return json.loads(response.choices[0].message.content)Deployment Options and Serving
Prompt flow supports multiple deployment patterns from local development to cloud production:
Local serving for development and testing:
# Start local server
pf flow serve --source ./my_flow --port 8080 --host localhost
# With FastAPI engine (newer, faster)
pf flow serve --source ./my_flow --port 8080 --engine fastapi
# Configure workers and threads
export PROMPTFLOW_WORKER_NUM=4
export PROMPTFLOW_THREADING_NUM=8
pf flow serve --source ./my_flow --port 8080Local server exposes REST API:
# Test endpoint
curl http://localhost:8080/score \
-X POST \
-H "Content-Type: application/json" \
-d '{"review_text": "This product is amazing!"}'Docker containerization for portable deployment:
# Build flow as Docker image
pf flow build --source ./my_flow --output ./docker_build --format docker
# Dockerfile generated automatically
cd docker_build
docker build -t my-flow:latest .
# Run container
docker run -p 8080:8080 \
-e OPENAI_API_KEY=sk-... \
my-flow:latestAzure App Service deployment:
# Build flow
pf flow build --source ./my_flow --output ./build --format docker
# Push to Azure Container Registry
az acr build --registry myregistry \
--image my-flow:v1 \
./build
# Deploy to App Service
az webapp create \
--resource-group mygroup \
--plan myplan \
--name my-flow-app \
--deployment-container-image-name myregistry.azurecr.io/my-flow:v1Azure AI Foundry managed endpoint:
from azure.ai.ml import MLClient
from azure.ai.ml.entities import ManagedOnlineEndpoint, ManagedOnlineDeployment
ml_client = MLClient(...)
# Create endpoint
endpoint = ManagedOnlineEndpoint(
name="sentiment-classifier-endpoint",
description="Sentiment classification service"
)
ml_client.online_endpoints.begin_create_or_update(endpoint).result()
# Deploy flow
deployment = ManagedOnlineDeployment(
name="production",
endpoint_name="sentiment-classifier-endpoint",
flow_path="./sentiment_flow",
instance_type="Standard_DS3_v2",
instance_count=2
)
ml_client.online_deployments.begin_create_or_update(deployment).result()Azure Kubernetes Service (AKS) for advanced orchestration:
# Kubernetes deployment manifest
apiVersion: apps/v1
kind: Deployment
metadata:
name: promptflow-app
spec:
replicas: 3
template:
spec:
containers:
- name: flow
image: myregistry.azurecr.io/my-flow:v1
ports:
- containerPort: 8080
env:
- name: OPENAI_API_KEY
valueFrom:
secretKeyRef:
name: openai-secret
key: api-keyDeployment considerations:
- Environment variables for API keys and configurations
- Health check endpoints (
/health) - Metrics endpoints (
/metrics) - Scaling policies (horizontal pod autoscaling)
- Load balancing and traffic management
- Monitoring and logging integration
VS Code Extension and Debugging
Prompt Flow VS Code extension provides integrated development environment:
Installation:
- Install VS Code Python extension
- Install Prompt Flow extension from marketplace
- Set Python interpreter to appropriate environment
Visual flow designer opens from flow.dag.yaml:
- Graphical node-and-edge representation
- Drag-and-drop node creation
- Visual connection drawing
- Real-time validation of connections
- Node configuration through UI
Debugging capabilities:
Breakpoint debugging in Python nodes:
from promptflow import tool
@tool
def complex_processing(data: str) -> dict:
# Set breakpoint here
intermediate = preprocess(data)
# Inspect intermediate variable
result = analyze(intermediate)
return resultSet breakpoints in VS Code, run flow in debug mode, and inspect variables at each step.
Step-by-step execution:
- Execute flow one node at a time
- Pause after each node to inspect outputs
- Continue or abort based on intermediate results
- Modify inputs and re-run specific nodes
Interactive mode for chat flows:
# Launch interactive chat UI
pf flow test --flow ./chat_flow --interactiveOpens chat interface for testing conversational flows with real-time debugging.
Flow testing within VS Code:
- Test entire flow from UI
- Test individual nodes in isolation
- View outputs inline
- Trace execution path
- Monitor token usage and latency
Trace visualization:
- Timeline view of node execution
- Token counts per LLM call
- Latency breakdown by node
- Input/output inspection
- Error stack traces
Connections and LLM Provider Configuration
Connections manage credentials and endpoints for external services:
Creating OpenAI connection:
# openai_connection.yaml
$schema: https://azuremlschemas.azureedge.net/promptflow/latest/OpenAIConnection.schema.json
type: open_ai
name: openai_connection
api_key: sk-...
organization: org-... # Optionalpf connection create --file openai_connection.yamlCreating Azure OpenAI connection:
# azure_openai_connection.yaml
$schema: https://azuremlschemas.azureedge.net/promptflow/latest/AzureOpenAIConnection.schema.json
type: azure_open_ai
name: azure_openai_connection
api_key: your-api-key
api_base: https://your-resource.openai.azure.com/
api_type: azure
api_version: 2024-02-01pf connection create --file azure_openai_connection.yamlUsing connections in flows:
nodes:
- name: chat
type: llm
source:
type: code
path: chat_prompt.jinja2
inputs:
deployment_name: gpt-4
temperature: 0.7
connection: openai_connection # Reference connection by nameEnvironment variable support for security:
# connection.yaml with env vars
api_key: ${OPENAI_API_KEY}
api_base: ${AZURE_OPENAI_ENDPOINT}Connection types supported:
- OpenAI
- Azure OpenAI
- Cognitive Search
- Serp API (web search)
- Custom connections (arbitrary key-value pairs)
Managing connections:
# List connections
pf connection list
# Show connection details
pf connection show --name openai_connection
# Update connection
pf connection update --name openai_connection --set api_key=new-key
# Delete connection
pf connection delete --name openai_connectionSecurity best practices:
- Store API keys in environment variables
- Use Azure Key Vault for production
- Never commit credentials to version control
- Use managed identities where possible
- Rotate keys regularly
Tracing and Observability Integration
Prompt flow integrates with OpenTelemetry for comprehensive observability:
Built-in tracing (enabled by default in v1.11+):
from promptflow import PFClient
pf = PFClient()
# Tracing automatically enabled
run = pf.run(
flow="./my_flow",
data="test.jsonl"
)
# View traces
pf.visualize(run) # Opens UI with trace visualizationOpenTelemetry integration:
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
# Configure OpenTelemetry
trace.set_tracer_provider(TracerProvider())
otlp_exporter = OTLPSpanExporter(endpoint="http://localhost:4317")
trace.get_tracer_provider().add_span_processor(
BatchSpanProcessor(otlp_exporter)
)
# Prompt flow automatically emits OTLP tracesLangChain integration tracing:
import langwatch
from openinference.instrumentation.promptflow import PromptFlowInstrumentor
from promptflow import PFClient
# Setup tracing for Prompt flow + LangChain
langwatch.setup(
instrumentors=[PromptFlowInstrumentor()]
)
pf = PFClient()
# All Prompt flow and LangChain operations traced
result = pf.run(
flow="./langchain_flow",
inputs={"query": "What is AI?"}
)Trace data captured:
- Node execution times and latency
- LLM calls with token counts
- Input/output values at each node
- Error traces and stack traces
- Custom metrics via
log_metric()
Disabling tracing when not needed:
export PF_DISABLE_TRACING=true
pf run create --flow ./my_flow --data test.jsonlTrace destinations:
- Local filesystem (default)
- Azure AI Studio (when using cloud)
- Cosmos DB (for persistence)
- Custom OTLP endpoints (Jaeger, Zipkin)
Trace visualization shows:
- Flow execution timeline
- Node dependencies and data flow
- Token usage per LLM call
- Cost breakdown
- Bottleneck identification
Version History and Package Structure
v1.17.1 (January 13, 2025): Bug fixes for Marshmallow 3.24 compatibility. Default disables tracing (PF_DISABLE_TRACING=true).
v1.17.0 (January 8, 2025): Dropped Python 3.8 support for security. Fixed token counting issues in tracing.
v1.16.2 (November 25, 2024): Security vulnerability patches.
v1.16.1 (October 8, 2024): Token counting bug fixes for None values.
v1.16.0 (September 30, 2024): Fixed input logging in serving app.
v1.15.0 (August 15, 2024): Fixed connection issues for local-to-cloud runs. Improved trace view for boolean outputs.
v1.14.0 (July 25, 2024): Added promptflow to Dockerfile automatically. Removed docutils dependency.
v1.13.0 (June 28, 2024): Fixed trace exporter incompatibility. Added ARM token caching for local-to-cloud runs.
v1.12.0 (June 11, 2024): Fixed ChatUI in Docker containers. Added retry logic for cloud uploads. Trace usage telemetry.
v1.11.0 (May 17, 2024): Major feature release.
- Flex Flow: Design apps with Python functions/classes flexibility
- Prompty: Experimental feature for simplified prompt templates (
.promptyfiles) - Trace upload: Local run details uploaded to cloud when configured
- Serving engine: Added
--engineparameter (flask vs fastapi) - Cosmos DB tracing: Refined setup with status monitoring
v1.10.0 (April 26, 2024): FastAPI serving engine support. Chat window UI (--ui flag). Search experience in trace UI.
v1.9.0 (April 17, 2024): Autocomplete for Linux. Trace experience in flow test and batch run.
v1.8.0 (April 10, 2024): Package restructuring.
- Split into multiple packages:
promptflow-tracing: Tracing capabilitypromptflow-core: Core flow executionpromptflow-devkit: Development toolspromptflow-azure: Azure integration
resume_fromfeature for resuming failed runs
Package structure implications:
# Full installation
pip install promptflow[azure]
# Core only (no Azure)
pip install promptflow-core
# Development tools
pip install promptflow-devkit
# Tracing only
pip install promptflow-tracingBackward compatibility: v1.8+ maintains API compatibility with earlier versions. Package split enables smaller dependency footprint for production deployments.