When I joined Pinterest’s Infrastructure Governance team, I found a system that had organically grown over 2 years into 80+ denormalized tables with 4.36TB of cost data, 50 tightly-coupled Python pipelines with 90% code duplication, no domain model, and a budget management tool built as a monolithic Flask app with no tests or type safety. The team was being asked to build a new entitlement platform for budget-aware provisioning, but they lacked the architectural foundation to do so safely.

Short version

Phase 1: Domain Model & Bounded Contexts “I decomposed the monolithic system into five bounded contexts:

Budget Planning - Multi-step approval workflows, hierarchical projections
Cost Attribution - Core domain entities (cost_center, usage_record, cost_attribution_record)
Cost Aggregation - Separated from pipeline logic
Advanced Forecasting - Marimo notebooks
Efficiency Monitoring

Phase 2: Budget Planning System (50k LOC) “I rebuilt the budget planning system from scratch using:

DDD + Ports & Adapters for testability and clear boundaries
GraphQL API (10 mutations, complex queries) for flexible planning cycle operations
Cedar policy language for fine-grained authorization (even built a Neovim plugin for it)
Automated integration testing with Docker Compose in Bazel CI/CD (solved complex symlink mounting issues)

The planning workflow supports:

Multi-org, multi-platform hierarchical budgeting
6-stage approval flow with role-based locking
Growth rate modeling + manual overrides per resource type
Real-time changelog tracking via GraphQL subscriptions”

Phase 3: Data Pipeline Modernization “I led the migration from duplicated Python/SQL to DBT:

Created reusable macros for common transformations (data cleanup, regex patterns)
Completed native platform attribution models (30-40 models)
Designed the path to 150+ models covering all multi-tenant platforms
Separated cost center hierarchy management from hard-coded pipeline logic”

The Temporal Decision: “For the upcoming entitlement platform, I evaluated three orchestration options:

Spinner / Airflow: high submission latency
AWS Step, lack of testability, logic in Json
Temporal: lower latency, queries and signals, workflow per project pattern for deduplication Proposed Architecture:
Each Nimbus project = Durable workflow
Budget check → Reserve → Provision → Confirm/Rollback
Use Temporal signals for real-time budget queries
Use Temporal queries to avoid duplicate provision requests
Built POC for EC2 ASG provisioning to validate approach”

Re-architecting Infrastructure Cost Governance at Pinterest Scale

The Problem

The System Context:

Scale: ~10GB/day of cost data from AWS Cost Explorer and Pinterest’s multi-tenant platforms (Moka for data processing, Pincompute for compute)
Complexity: Thousands of cost centers (Nimbus projects, LDAP groups) with hundreds of budget transfer requests monthly
Data Warehouse: 4.36 TB accumulated over 2 years with no logical domain model

Specific Technical Debt:

The 50 data pipelines used string .replace() for SQL templating instead of Jinja - the team had actually overridden Airflow’s SparkSQL operator to bypass proper templating
Each pipeline contained 5-10 CTEs doing similar transformations, with 90% code duplication across pipelines
Cost center aggregation logic was hard-wired into the structure of tables and the sequence of pipeline execution - changing how costs rolled up required modifying multiple pipelines and table schemas
The budget transfer tool was a Flask/SQLAlchemy monolith with no separation between REST handlers, business logic, and database access

The New Challenge: The adjacent team needed to build an “Entitlement” platform - ensuring budget availability before provisioning resources. Without architectural intervention, they would have implemented this as ad-hoc distributed RPC calls between services, inevitably leading to race conditions, duplicate reservations, and lost budget updates.

My Approach

Phase 1: Domain Model & Bounded Contexts

The first challenge was conceptual. The system had grown organically with no domain model - just tables and pipelines. There wasn’t even a budget planning capability; only budget transfers existed. When planning was later needed, someone decided to build it as a feature of the transfers tool, creating fundamental architectural coupling.

I decomposed the monolithic system into five bounded contexts with clear boundaries:

1. Budget Planning (New - What I Built)

This was the core system I implemented from scratch over 5 months (50k lines of code):

The Planning Hierarchy:

Organization → Platform (EC2, S3, Moka, Pincompute) → Resource Type (m5.large, s3-standard, etc.) → Monthly projections

The Multi-Step Approval Workflow:

New: Planning cycle created, initializes from previous year’s actual spending
Spend Captain Review: Organization-level review, can set platform-level targets and locks
Platform Owner Review: Platform teams review their allocations
Service Owner Review: Individual project owners adjust their forecasts
Final Review: Finance reviews consolidated plan
Approved: Plan becomes the official budget

Key Planning Operations:

Initialize projections by querying the 4.36 TB data warehouse for last year’s actuals
Apply growth rates (percentage increase month-over-month) or manual overrides per resource type
Lock/unlock by role: when a Spend Captain locks the EC2 platform, Service Owners cannot modify EC2 projections
Maintain a changelog for every modification for audit compliance

2. Cost Attribution (Domain Model Extraction)

I introduced explicit domain entities that had been implicit across the 80+ denormalized tables:

cost_center - The fundamental unit (a Nimbus project or LDAP group)
cost_center_hierarchy - Parent-child relationships for rolling up costs
usage_record - Raw usage from platforms (EC2 hours, S3 GB, Moka compute units)
cost_attribution_record - Final attributed costs after allocation logic

This domain model now serves as the foundation for all cost operations, replacing pipeline-specific logic.

3. Cost Aggregation (Separated from Pipelines)

Previously, aggregation logic was embedded in the sequence of pipeline execution. I extracted this into a separate bounded context responsible for:

Aggregating costs up the hierarchy (project → team → org)
Computing platform-level summaries
Generating targets vs. actuals comparisons for budget tracking

4. Advanced Budget Forecasting

Built using Marimo notebooks for interactive financial modeling:

Trend analysis on historical spending patterns
Scenario modeling (what-if analysis for different growth assumptions)
Anomaly detection to flag unusual spending

5. Efficiency Monitoring

Cross-platform efficiency metrics:

Cost per transaction or request
Resource utilization rates
Waste detection (idle or underutilized resources)

Future Work Identified: I recognized that rate management (Pinterest’s blended rates for native and multi-tenant resources) should be a separate microservice, especially since the Entitlement platform would need rate information for budget checks. This represents architectural debt we’re working to address.

Phase 2: Budget Planning System Architecture (50k LOC)

I rebuilt the budget planning system from scratch, writing approximately 50,000 lines of production code in 5 months.

Architectural Pattern: Ports & Adapters (Hexagonal Architecture)

Why this pattern: The team lacked software engineering fundamentals. Ports & Adapters forces structural discipline - a pure domain core with no external dependencies, surrounded by adapters for infrastructure concerns. This was critical for:

Testability: Can test business logic without databases or external APIs
Future flexibility: Can extract components into separate microservices later
Forcing function: Team couldn’t fall back to tightly-coupled patterns

The Layers:

Domain Layer: Pure business logic
- Planning cycle state machine (enforces valid state transitions)
- Approval workflow rules (who can approve at each stage)
- Resource projection calculations (growth rates, overrides)
- Authorization policies (implemented via Cedar language)
Application Layer: Use case orchestration
- “Create Planning Cycle” use case queries data warehouse, creates domain objects
- “Approve Stage” use case validates approver permissions, transitions state
Infrastructure Layer: Adapters for external systems
- PostgreSQL for persistence
- Data warehouse for historical actuals
- Nimbus API for project metadata
- HR system API for LDAP group information
- Cedar policy engine for authorization

GraphQL API Design

Why GraphQL over REST:

Hierarchical queries: A planning cycle is inherently hierarchical (Org → Platform → Resource → Monthly entries). GraphQL lets clients request the exact depth they need.
Role-based data fetching: A Service Owner sees only their projects; a Spend Captain sees the entire org. GraphQL resolvers handle this naturally.
Real-time updates: GraphQL subscriptions enable live updates during collaborative planning sessions.

The API Surface:

10 Mutations: Create cycle, update projections, apply growth rates, lock/unlock platforms, approve stage, bulk operations, etc.
Complex Queries:
- Planning cycle with nested hierarchy and filtering
- Changelog with date range and user filters
- Summary aggregations (total by platform, variance from targets)

Example Query Pattern:

query PlanningCycleDetails($orgId: ID!, $cycleId: ID!) {
  planningCycle(orgId: $orgId, id: $cycleId) {
    status
    targetsByPlatform { platform amount }
    entries(platform: EC2) {
      project { id name }
      resources {
        resourceType
        historicalSpend
        plannedSpend
        variance
      }
    }
  }
}

Authorization with Cedar Policy Language

Why Cedar: Cedar is Amazon’s open-source policy language - expressive, auditable, and supports complex hierarchical permissions.

The Challenge:

Different roles have different capabilities at different workflow stages
Locks cascade (platform lock prevents resource-level changes)
Policies must be version-controlled and testable

Example Policy:

permit(
  principal in Role::"SpendCaptain",
  action == Action::"Lock",
  resource in Platform::"EC2"
) when {
  context.planning_cycle.status == "SpendCaptainReview"
};

Implementation Detail: The Cedar policies are embedded within the infra-budget-tool service (not a separate service). I also built a Neovim plugin for the Cedar TreeSitter parser to improve the developer experience when writing policies.

Key Technical Challenges Solved

1. Initializing from Historical Data

Challenge: Query 4.36 TB warehouse efficiently to get last year’s spending by project, platform, and resource type
Solution: Materialized views for common aggregations, incremental refresh patterns
Edge Case: New projects with no history default to org-level averages for their platform

2. Locking Semantics

Challenge: Implement cascading locks (platform lock prevents resource changes) with role-specific unlock capabilities
Solution: Domain model tracks locks as first-class entities with lock scope and owner role
Complexity: When Spend Captain locks EC2, Service Owners see read-only view but Platform Owner can still unlock

3. Concurrent Modifications

Challenge: Multiple users editing the same planning cycle simultaneously
Solution: Optimistic locking with version numbers, GraphQL subscriptions for real-time conflict detection
UX Impact: Users see others’ changes in real-time and can coordinate via lock announcements

4. Audit Trail / Changelog

Challenge: Track every modification (who, when, what changed, why) for financial compliance
Solution: Event-sourcing lite - append-only log, queryable via GraphQL with rich filtering
Scale: Changelog grows with every modification but remains queryable via indexed fields

Automated Integration Testing

The Testing Challenge: The budget planning system has complex dependencies: PostgreSQL database, data warehouse connections, external APIs (Nimbus, HR system). The team had no integration tests.

The Solution: I implemented Docker Compose-based integration testing in Bazel CI/CD.

The Technical Hurdle: Bazel runs tests in sandboxed environments where all files are symlinks. Docker cannot mount symlinks that point outside the container’s filesystem.

My Workaround:

Read files from BAZEL_TEST_DIR (which contains symlinks)
Copy contents to a temporary directory (real files)
Mount the temp directory in Docker Compose
Clean up after test completion

Impact: Every pull request now runs full integration tests (database + API + external mocks) in ~3 minutes.

Phase 3: Data Pipeline Modernization with DBT

The existing 50 data pipelines represented extreme technical debt. I led the migration to DBT to introduce modularity and reusability.

The Core Problem:

90% code duplication: the same transformations (cleaning project names, parsing LDAP groups, applying allocation logic) repeated across pipelines
No templating: string .replace() instead of Jinja
No lineage: impossible to trace which downstream tables depend on which upstream data
No testing: SQL was embedded in Python strings with no validation

The Migration Strategy:

Phase 1: Reusable DBT Macros (Completed) I identified repeated patterns and created macros:

normalize_project_name(): CASE statements to standardize Nimbus project naming conventions
parse_ldap_group(): Regex patterns to extract group hierarchies from LDAP strings
coalesce_cost_center(): Fallback logic when multiple cost center identifiers exist
allocate_shared_cost(): Proportional allocation algorithms for shared resources
apply_blended_rates(): Pinterest’s custom rate calculations for multi-tenant platforms

Phase 2: Native Platform Models (30-40 models completed) For AWS services where we don’t need platform team input:

EC2 instance attribution by project (using tags and instance metadata)
S3 storage attribution by bucket ownership
RDS database costs by instance metadata
CloudFront distribution costs

These models replace approximately 15 of the original 50 pipelines, with full lineage tracking and testing.

Phase 3: Multi-Tenant Platform Models (In Progress, 100+ models planned) For Pinterest’s internal platforms (Moka, Pincompute), migration is more complex because:

Usage data comes from platform-specific APIs with different schemas
Attribution logic varies by platform (Moka uses job-based attribution, Pincompute uses task-based)
Resource hierarchies are platform-defined

Architectural Insight: The original pipeline architecture hard-coded cost center aggregation into table structures and pipeline sequencing. By modeling cost_center_hierarchy explicitly as a DBT dimension table, we can now change aggregation rules (e.g., reorganizations, team moves) without rewriting pipelines - just update the hierarchy table and re-run.

Current Status:

30-40 DBT models in production
Remaining 100+ models designed and prioritized
Team trained on DBT development patterns
CI/CD pipeline validates DBT models on every commit

Phase 4: Temporal for the Entitlement Platform

The adjacent team was tasked with building the “Entitlement” platform: ensuring budget availability before provisioning resources. I evaluated orchestration options and recommended Temporal.

The Requirement: For every resource provision request (e.g., create EC2 Auto Scaling Group):

Check: Does the project have available budget?
Reserve: Temporarily hold the estimated cost (optimistic locking)
Provision: Call AWS/internal APIs to create the resource
Confirm or Rollback:
- Success: Deduct actual cost from budget
- Failure: Release the reservation

This is a distributed transaction across budget and provisioning systems with partial failure scenarios.

Option 1: Airflow (Spinner)

Pinterest uses Airflow heavily via a custom platform called “Spinner,” integrated with Moka and other internal systems.

Why it seemed reasonable:

Familiar to the team
Existing infrastructure and operational expertise
Good integration with Pinterest’s data platforms

Why it doesn’t work:

High submission latency: Scheduling a DAG takes minutes, unacceptable for synchronous provision requests
No real-time state queries: Cannot query “What’s the current reservation amount for project X?”
Immutable DAGs: Cannot modify workflow logic for running tasks
Wrong abstraction: Airflow is designed for batch ETL, not transactional workflows

Verdict: ❌ Wrong tool for the job

Option 2: AWS Step Functions

Why it seemed reasonable:

Managed service, no operational overhead
Native AWS integration
Built-in error handling and retries

Why it’s limiting:

Tight coupling to AWS: Vendor lock-in, blocks multi-cloud strategy
Poor observability: Limited ability to query workflow state programmatically
Expensive for long-running workflows: Charged per state transition; some provisions take hours
JSON state machines: Hard to version control, test, and maintain
No advanced patterns: No support for signals, queries, or parent-child workflow coordination

Verdict: ❌ Too restrictive for complex orchestration

Option 3: Temporal (Recommended)

Why Temporal is the right choice:

1. Low Latency

Workflows start in milliseconds, suitable for synchronous provision requests
No scheduling overhead like Airflow

2. Built-in State Queries

Any service can query running workflows: workflow.query("get_reservation_details")
Enables real-time aggregation: “How much budget is reserved across all active provisions for project X?”

3. Signals for External Events

Can send signals to running workflows: workflow.signal("cancel_provision")
Enables coordination between budget changes and ongoing provisions

4. Workflow-per-Project Pattern

WorkflowID includes project and request ID: provision-project123-request456
Temporal guarantees idempotency: duplicate requests with same ID are ignored
Prevents double-provisioning and double-charging

5. Saga Pattern Support

Each activity has a compensating action
On failure, Temporal automatically runs compensations in reverse order
Perfect for distributed transactions (reserve → provision → confirm/rollback)

6. Versioning

Can deploy new workflow code without breaking running instances
Gradual rollout of workflow logic changes

7. Observability

Built-in UI shows execution history, current state, stack traces for failures
Crucial for debugging provision failures

The Proposed Architecture

Workflow Design:

WorkflowID: f"provision-{project_id}-{request_id}"

Activities:
1. check_budget(project_id, estimated_cost)
   → Returns available budget amount
   
2. reserve_budget(project_id, estimated_cost, ttl=1h)
   → Returns reservation_id
   → Reservation expires if not confirmed within TTL
   
3. provision_resource(resource_spec)
   → Calls AWS/internal APIs
   → Retries with exponential backoff
   → Can run for minutes/hours
   
4a. On Success:
    confirm_budget(reservation_id, actual_cost)
    → Deducts from budget permanently
    
4b. On Failure:
    release_budget(reservation_id)
    → Compensating action, releases reservation

Key Patterns:

Workflow-per-Project Idempotency:

WorkflowID deterministically includes project_id and request_id
If the same provision request is submitted twice, Temporal recognizes duplicate ID and returns the existing workflow
Prevents accidental double-provisioning

Real-Time Budget Queries:

Other services query: workflow.query("get_current_state") → “PROVISIONING” | “COMPLETED” | “FAILED”
Finance dashboards query: workflow.query("get_reservation_details") → {amount, timestamp, resource_type}
Aggregated across all workflows for a project to show “total reserved budget”

Long-Running Reservations:

Some provisions take hours (e.g., warming up large ML clusters)
Temporal workflows can run for days/weeks without issues
Reservation TTL can be extended via signals if needed

Compensating Actions (Saga Pattern):

If provisioning fails after budget reservation, release_budget() is automatically called
If provisioning succeeds but confirmation fails, manual intervention workflow is triggered
All compensations logged for audit trail

Proof of Concept

I built a POC for EC2 Auto Scaling Group provisioning:

Implemented a simple 5-activity workflow
Integrated with mock budget service
Demonstrated failure handling: if provision times out, reservation is automatically released
Showed query pattern: external service queries workflow for reservation details
Provided architecture documentation and operational runbook

Team Adoption:

The team had never heard of Temporal before
After the POC demonstration, engineering leadership approved Temporal for the entitlement platform
Began staffing the implementation team
I provided ongoing architectural guidance

What Would Have Happened Without This: The team would have built custom orchestration with Lambda + Step Functions, leading to:

6+ months debugging distributed state issues (lost reservations, double-charging, race conditions)
Poor visibility into provision failures
Difficulty implementing compensating actions correctly
No idempotency guarantees, requiring application-level deduplication logic

Impact & Organizational Outcomes

Quantitative Results:

50,000 lines of production code (budget planning system)
30-40 DBT models migrated (of 150+ planned)
4.36 TB data warehouse now has proper domain model
Thousands of cost centers managed through new planning system
Hundreds of budget operations monthly through improved workflows

Architectural Foundation:

Five bounded contexts provide clear separation for future development
Domain model enables changes without rewriting pipelines
Ports & Adapters architecture enables future microservices extraction

Risk Mitigation:

Prevented 6+ months of distributed systems debugging by introducing Temporal
Established testing discipline (integration tests in CI/CD)
Created audit trail for financial compliance

Team Transformation:

Introduced modern practices (DDD, testing, domain modeling) to a team that lacked fundamentals
Established code review standards and design documentation
Trained team on DBT, GraphQL, and distributed systems patterns

Future-Proofing:

GraphQL API enables rapid iteration vs. monolithic REST
Cedar policies provide flexible, auditable authorization
DBT migration sets foundation for 100+ more models
Temporal architecture scales to multi-cloud provisioning

Key Technical Trade-offs

1. Ports & Adapters vs. Simpler Patterns

Decision: Use hexagonal architecture despite team’s limited experience

Reasoning: Team would produce tightly-coupled code without forcing function. Testability was zero and needed structural discipline.

Trade-offs:

✅ Enabled comprehensive integration testing
✅ Future-proofs for microservices extraction
❌ Higher learning curve
❌ More initial boilerplate

2. GraphQL vs. REST

Decision: GraphQL for complex hierarchical queries

Reasoning: Planning cycles are deeply nested (Org → Platform → Resource → Monthly). REST would require many round trips or over-fetching.

Trade-offs:

✅ Flexible client queries, role-based data fetching
✅ Real-time subscriptions for collaboration
❌ More complex server implementation
❌ Team unfamiliar with GraphQL

3. DBT Migration Strategy (Incremental vs. Big Bang)

Decision: Incremental migration starting with native platforms

Reasoning: Rewriting all 50 pipelines simultaneously too risky. Native platforms (AWS) easier to model than multi-tenant platforms.

Trade-offs:

✅ Reduces risk, delivers value incrementally
✅ Proves patterns before scaling
❌ Maintains dual systems during transition
❌ Some technical debt persists longer

4. Temporal vs. Step Functions

Decision: Temporal for entitlement workflows

Reasoning: Need for state queries, signals, long-running workflows, and idempotency guarantees. Step Functions too limiting.

Trade-offs:

✅ Better developer experience and observability
✅ Flexible workflow patterns
❌ Operational overhead (self-hosted)
❌ Team learning curve

Lessons Learned

1. Architecture as a Forcing Function When teams lack fundamentals, choosing constraining patterns (like Ports & Adapters) forces better practices. The structure guides the team toward maintainable code.

2. Incremental Modernization Complete rewrites are too risky. The DBT migration’s incremental approach (macros first, then native platforms, then multi-tenant) reduced risk while delivering value.

3. Technology Evangelism Requires POCs The team had never heard of Temporal. A working proof-of-concept demonstrating actual value was more persuasive than architectural arguments.

4. Domain Modeling Enables Change Extracting implicit domain concepts (cost_center_hierarchy) from table structures enables future changes without rewriting pipelines. The upfront investment pays dividends.

5. Authorization as First-Class Concern Using Cedar policy language (vs. hardcoded checks) made authorization auditable, testable, and evolvable. Financial systems require this level of rigor.

Edmondo's Vault

Explorer

Pinterest - Infragov modernization - wider

Short version

Re-architecting Infrastructure Cost Governance at Pinterest Scale

The Problem

My Approach

Phase 1: Domain Model & Bounded Contexts

1. Budget Planning (New - What I Built)

2. Cost Attribution (Domain Model Extraction)

3. Cost Aggregation (Separated from Pipelines)

4. Advanced Budget Forecasting

5. Efficiency Monitoring

Phase 2: Budget Planning System Architecture (50k LOC)

Architectural Pattern: Ports & Adapters (Hexagonal Architecture)

GraphQL API Design

Authorization with Cedar Policy Language

Key Technical Challenges Solved

Automated Integration Testing

Phase 3: Data Pipeline Modernization with DBT

Phase 4: Temporal for the Entitlement Platform

Option 1: Airflow (Spinner)

Option 2: AWS Step Functions

Option 3: Temporal (Recommended)

The Proposed Architecture

Proof of Concept

Impact & Organizational Outcomes

Key Technical Trade-offs

1. Ports & Adapters vs. Simpler Patterns

2. GraphQL vs. REST

3. DBT Migration Strategy (Incremental vs. Big Bang)

4. Temporal vs. Step Functions

Lessons Learned

Graph View

Table of Contents

Backlinks