When I joined Pinterest’s Infrastructure Governance team, I found a system that had organically grown over 2 years into 80+ denormalized tables with 4.36TB of cost data, 50 tightly-coupled Python pipelines with 90% code duplication, no domain model, and a budget management tool built as a monolithic Flask app with no tests or type safety. The team was being asked to build a new entitlement platform for budget-aware provisioning, but they lacked the architectural foundation to do so safely.
Short version
Phase 1: Domain Model & Bounded Contexts “I decomposed the monolithic system into five bounded contexts:
- Budget Planning - Multi-step approval workflows, hierarchical projections
- Cost Attribution - Core domain entities (cost_center, usage_record, cost_attribution_record)
- Cost Aggregation - Separated from pipeline logic
- Advanced Forecasting - Marimo notebooks
- Efficiency Monitoring
Phase 2: Budget Planning System (50k LOC) “I rebuilt the budget planning system from scratch using:
- DDD + Ports & Adapters for testability and clear boundaries
- GraphQL API (10 mutations, complex queries) for flexible planning cycle operations
- Cedar policy language for fine-grained authorization (even built a Neovim plugin for it)
- Automated integration testing with Docker Compose in Bazel CI/CD (solved complex symlink mounting issues)
The planning workflow supports:
- Multi-org, multi-platform hierarchical budgeting
- 6-stage approval flow with role-based locking
- Growth rate modeling + manual overrides per resource type
- Real-time changelog tracking via GraphQL subscriptions”
Phase 3: Data Pipeline Modernization “I led the migration from duplicated Python/SQL to DBT:
- Created reusable macros for common transformations (data cleanup, regex patterns)
- Completed native platform attribution models (30-40 models)
- Designed the path to 150+ models covering all multi-tenant platforms
- Separated cost center hierarchy management from hard-coded pipeline logic”
The Temporal Decision: “For the upcoming entitlement platform, I evaluated three orchestration options:
-
Spinner / Airflow: high submission latency
-
AWS Step, lack of testability, logic in Json
-
Temporal: lower latency, queries and signals, workflow per project pattern for deduplication Proposed Architecture:
-
Each Nimbus project = Durable workflow
-
Budget check → Reserve → Provision → Confirm/Rollback
-
Use Temporal signals for real-time budget queries
-
Use Temporal queries to avoid duplicate provision requests
-
Built POC for EC2 ASG provisioning to validate approach”
Re-architecting Infrastructure Cost Governance at Pinterest Scale
The Problem
When I joined Pinterest’s Infrastructure Governance team, I found a system that had organically grown over 2 years into 80+ denormalized tables with 4.36TB of cost data, 50 tightly-coupled Python pipelines with 90% code duplication, no domain model, and a budget management tool built as a monolithic Flask app with no tests or type safety. The team was being asked to build a new entitlement platform for budget-aware provisioning, but they lacked the architectural foundation to do so safely.
The System Context:
- Scale: ~10GB/day of cost data from AWS Cost Explorer and Pinterest’s multi-tenant platforms (Moka for data processing, Pincompute for compute)
- Complexity: Thousands of cost centers (Nimbus projects, LDAP groups) with hundreds of budget transfer requests monthly
- Data Warehouse: 4.36 TB accumulated over 2 years with no logical domain model
Specific Technical Debt:
- The 50 data pipelines used string
.replace()for SQL templating instead of Jinja - the team had actually overridden Airflow’s SparkSQL operator to bypass proper templating - Each pipeline contained 5-10 CTEs doing similar transformations, with 90% code duplication across pipelines
- Cost center aggregation logic was hard-wired into the structure of tables and the sequence of pipeline execution - changing how costs rolled up required modifying multiple pipelines and table schemas
- The budget transfer tool was a Flask/SQLAlchemy monolith with no separation between REST handlers, business logic, and database access
The New Challenge: The adjacent team needed to build an “Entitlement” platform - ensuring budget availability before provisioning resources. Without architectural intervention, they would have implemented this as ad-hoc distributed RPC calls between services, inevitably leading to race conditions, duplicate reservations, and lost budget updates.
My Approach
Phase 1: Domain Model & Bounded Contexts
The first challenge was conceptual. The system had grown organically with no domain model - just tables and pipelines. There wasn’t even a budget planning capability; only budget transfers existed. When planning was later needed, someone decided to build it as a feature of the transfers tool, creating fundamental architectural coupling.
I decomposed the monolithic system into five bounded contexts with clear boundaries:
1. Budget Planning (New - What I Built)
This was the core system I implemented from scratch over 5 months (50k lines of code):
The Planning Hierarchy:
- Organization → Platform (EC2, S3, Moka, Pincompute) → Resource Type (m5.large, s3-standard, etc.) → Monthly projections
The Multi-Step Approval Workflow:
- New: Planning cycle created, initializes from previous year’s actual spending
- Spend Captain Review: Organization-level review, can set platform-level targets and locks
- Platform Owner Review: Platform teams review their allocations
- Service Owner Review: Individual project owners adjust their forecasts
- Final Review: Finance reviews consolidated plan
- Approved: Plan becomes the official budget
Key Planning Operations:
- Initialize projections by querying the 4.36 TB data warehouse for last year’s actuals
- Apply growth rates (percentage increase month-over-month) or manual overrides per resource type
- Lock/unlock by role: when a Spend Captain locks the EC2 platform, Service Owners cannot modify EC2 projections
- Maintain a changelog for every modification for audit compliance
2. Cost Attribution (Domain Model Extraction)
I introduced explicit domain entities that had been implicit across the 80+ denormalized tables:
cost_center- The fundamental unit (a Nimbus project or LDAP group)cost_center_hierarchy- Parent-child relationships for rolling up costsusage_record- Raw usage from platforms (EC2 hours, S3 GB, Moka compute units)cost_attribution_record- Final attributed costs after allocation logic
This domain model now serves as the foundation for all cost operations, replacing pipeline-specific logic.
3. Cost Aggregation (Separated from Pipelines)
Previously, aggregation logic was embedded in the sequence of pipeline execution. I extracted this into a separate bounded context responsible for:
- Aggregating costs up the hierarchy (project → team → org)
- Computing platform-level summaries
- Generating targets vs. actuals comparisons for budget tracking
4. Advanced Budget Forecasting
Built using Marimo notebooks for interactive financial modeling:
- Trend analysis on historical spending patterns
- Scenario modeling (what-if analysis for different growth assumptions)
- Anomaly detection to flag unusual spending
5. Efficiency Monitoring
Cross-platform efficiency metrics:
- Cost per transaction or request
- Resource utilization rates
- Waste detection (idle or underutilized resources)
Future Work Identified: I recognized that rate management (Pinterest’s blended rates for native and multi-tenant resources) should be a separate microservice, especially since the Entitlement platform would need rate information for budget checks. This represents architectural debt we’re working to address.
Phase 2: Budget Planning System Architecture (50k LOC)
I rebuilt the budget planning system from scratch, writing approximately 50,000 lines of production code in 5 months.
Architectural Pattern: Ports & Adapters (Hexagonal Architecture)
Why this pattern: The team lacked software engineering fundamentals. Ports & Adapters forces structural discipline - a pure domain core with no external dependencies, surrounded by adapters for infrastructure concerns. This was critical for:
- Testability: Can test business logic without databases or external APIs
- Future flexibility: Can extract components into separate microservices later
- Forcing function: Team couldn’t fall back to tightly-coupled patterns
The Layers:
- Domain Layer: Pure business logic
- Planning cycle state machine (enforces valid state transitions)
- Approval workflow rules (who can approve at each stage)
- Resource projection calculations (growth rates, overrides)
- Authorization policies (implemented via Cedar language)
- Application Layer: Use case orchestration
- “Create Planning Cycle” use case queries data warehouse, creates domain objects
- “Approve Stage” use case validates approver permissions, transitions state
- Infrastructure Layer: Adapters for external systems
- PostgreSQL for persistence
- Data warehouse for historical actuals
- Nimbus API for project metadata
- HR system API for LDAP group information
- Cedar policy engine for authorization
GraphQL API Design
Why GraphQL over REST:
- Hierarchical queries: A planning cycle is inherently hierarchical (Org → Platform → Resource → Monthly entries). GraphQL lets clients request the exact depth they need.
- Role-based data fetching: A Service Owner sees only their projects; a Spend Captain sees the entire org. GraphQL resolvers handle this naturally.
- Real-time updates: GraphQL subscriptions enable live updates during collaborative planning sessions.
The API Surface:
- 10 Mutations: Create cycle, update projections, apply growth rates, lock/unlock platforms, approve stage, bulk operations, etc.
- Complex Queries:
- Planning cycle with nested hierarchy and filtering
- Changelog with date range and user filters
- Summary aggregations (total by platform, variance from targets)
Example Query Pattern:
query PlanningCycleDetails($orgId: ID!, $cycleId: ID!) {
planningCycle(orgId: $orgId, id: $cycleId) {
status
targetsByPlatform { platform amount }
entries(platform: EC2) {
project { id name }
resources {
resourceType
historicalSpend
plannedSpend
variance
}
}
}
}Authorization with Cedar Policy Language
Why Cedar: Cedar is Amazon’s open-source policy language - expressive, auditable, and supports complex hierarchical permissions.
The Challenge:
- Different roles have different capabilities at different workflow stages
- Locks cascade (platform lock prevents resource-level changes)
- Policies must be version-controlled and testable
Example Policy:
permit(
principal in Role::"SpendCaptain",
action == Action::"Lock",
resource in Platform::"EC2"
) when {
context.planning_cycle.status == "SpendCaptainReview"
};
Implementation Detail: The Cedar policies are embedded within the infra-budget-tool service (not a separate service). I also built a Neovim plugin for the Cedar TreeSitter parser to improve the developer experience when writing policies.
Key Technical Challenges Solved
1. Initializing from Historical Data
- Challenge: Query 4.36 TB warehouse efficiently to get last year’s spending by project, platform, and resource type
- Solution: Materialized views for common aggregations, incremental refresh patterns
- Edge Case: New projects with no history default to org-level averages for their platform
2. Locking Semantics
- Challenge: Implement cascading locks (platform lock prevents resource changes) with role-specific unlock capabilities
- Solution: Domain model tracks locks as first-class entities with lock scope and owner role
- Complexity: When Spend Captain locks EC2, Service Owners see read-only view but Platform Owner can still unlock
3. Concurrent Modifications
- Challenge: Multiple users editing the same planning cycle simultaneously
- Solution: Optimistic locking with version numbers, GraphQL subscriptions for real-time conflict detection
- UX Impact: Users see others’ changes in real-time and can coordinate via lock announcements
4. Audit Trail / Changelog
- Challenge: Track every modification (who, when, what changed, why) for financial compliance
- Solution: Event-sourcing lite - append-only log, queryable via GraphQL with rich filtering
- Scale: Changelog grows with every modification but remains queryable via indexed fields
Automated Integration Testing
The Testing Challenge: The budget planning system has complex dependencies: PostgreSQL database, data warehouse connections, external APIs (Nimbus, HR system). The team had no integration tests.
The Solution: I implemented Docker Compose-based integration testing in Bazel CI/CD.
The Technical Hurdle: Bazel runs tests in sandboxed environments where all files are symlinks. Docker cannot mount symlinks that point outside the container’s filesystem.
My Workaround:
- Read files from
BAZEL_TEST_DIR(which contains symlinks) - Copy contents to a temporary directory (real files)
- Mount the temp directory in Docker Compose
- Clean up after test completion
Impact: Every pull request now runs full integration tests (database + API + external mocks) in ~3 minutes.
Phase 3: Data Pipeline Modernization with DBT
The existing 50 data pipelines represented extreme technical debt. I led the migration to DBT to introduce modularity and reusability.
The Core Problem:
- 90% code duplication: the same transformations (cleaning project names, parsing LDAP groups, applying allocation logic) repeated across pipelines
- No templating: string
.replace()instead of Jinja - No lineage: impossible to trace which downstream tables depend on which upstream data
- No testing: SQL was embedded in Python strings with no validation
The Migration Strategy:
Phase 1: Reusable DBT Macros (Completed) I identified repeated patterns and created macros:
normalize_project_name(): CASE statements to standardize Nimbus project naming conventionsparse_ldap_group(): Regex patterns to extract group hierarchies from LDAP stringscoalesce_cost_center(): Fallback logic when multiple cost center identifiers existallocate_shared_cost(): Proportional allocation algorithms for shared resourcesapply_blended_rates(): Pinterest’s custom rate calculations for multi-tenant platforms
Phase 2: Native Platform Models (30-40 models completed) For AWS services where we don’t need platform team input:
- EC2 instance attribution by project (using tags and instance metadata)
- S3 storage attribution by bucket ownership
- RDS database costs by instance metadata
- CloudFront distribution costs
These models replace approximately 15 of the original 50 pipelines, with full lineage tracking and testing.
Phase 3: Multi-Tenant Platform Models (In Progress, 100+ models planned) For Pinterest’s internal platforms (Moka, Pincompute), migration is more complex because:
- Usage data comes from platform-specific APIs with different schemas
- Attribution logic varies by platform (Moka uses job-based attribution, Pincompute uses task-based)
- Resource hierarchies are platform-defined
Architectural Insight: The original pipeline architecture hard-coded cost center aggregation into table structures and pipeline sequencing. By modeling cost_center_hierarchy explicitly as a DBT dimension table, we can now change aggregation rules (e.g., reorganizations, team moves) without rewriting pipelines - just update the hierarchy table and re-run.
Current Status:
- 30-40 DBT models in production
- Remaining 100+ models designed and prioritized
- Team trained on DBT development patterns
- CI/CD pipeline validates DBT models on every commit
Phase 4: Temporal for the Entitlement Platform
The adjacent team was tasked with building the “Entitlement” platform: ensuring budget availability before provisioning resources. I evaluated orchestration options and recommended Temporal.
The Requirement: For every resource provision request (e.g., create EC2 Auto Scaling Group):
- Check: Does the project have available budget?
- Reserve: Temporarily hold the estimated cost (optimistic locking)
- Provision: Call AWS/internal APIs to create the resource
- Confirm or Rollback:
- Success: Deduct actual cost from budget
- Failure: Release the reservation
This is a distributed transaction across budget and provisioning systems with partial failure scenarios.
Option 1: Airflow (Spinner)
Pinterest uses Airflow heavily via a custom platform called “Spinner,” integrated with Moka and other internal systems.
Why it seemed reasonable:
- Familiar to the team
- Existing infrastructure and operational expertise
- Good integration with Pinterest’s data platforms
Why it doesn’t work:
- High submission latency: Scheduling a DAG takes minutes, unacceptable for synchronous provision requests
- No real-time state queries: Cannot query “What’s the current reservation amount for project X?”
- Immutable DAGs: Cannot modify workflow logic for running tasks
- Wrong abstraction: Airflow is designed for batch ETL, not transactional workflows
Verdict: ❌ Wrong tool for the job
Option 2: AWS Step Functions
Why it seemed reasonable:
- Managed service, no operational overhead
- Native AWS integration
- Built-in error handling and retries
Why it’s limiting:
- Tight coupling to AWS: Vendor lock-in, blocks multi-cloud strategy
- Poor observability: Limited ability to query workflow state programmatically
- Expensive for long-running workflows: Charged per state transition; some provisions take hours
- JSON state machines: Hard to version control, test, and maintain
- No advanced patterns: No support for signals, queries, or parent-child workflow coordination
Verdict: ❌ Too restrictive for complex orchestration
Option 3: Temporal (Recommended)
Why Temporal is the right choice:
1. Low Latency
- Workflows start in milliseconds, suitable for synchronous provision requests
- No scheduling overhead like Airflow
2. Built-in State Queries
- Any service can query running workflows:
workflow.query("get_reservation_details") - Enables real-time aggregation: “How much budget is reserved across all active provisions for project X?”
3. Signals for External Events
- Can send signals to running workflows:
workflow.signal("cancel_provision") - Enables coordination between budget changes and ongoing provisions
4. Workflow-per-Project Pattern
- WorkflowID includes project and request ID:
provision-project123-request456 - Temporal guarantees idempotency: duplicate requests with same ID are ignored
- Prevents double-provisioning and double-charging
5. Saga Pattern Support
- Each activity has a compensating action
- On failure, Temporal automatically runs compensations in reverse order
- Perfect for distributed transactions (reserve → provision → confirm/rollback)
6. Versioning
- Can deploy new workflow code without breaking running instances
- Gradual rollout of workflow logic changes
7. Observability
- Built-in UI shows execution history, current state, stack traces for failures
- Crucial for debugging provision failures
The Proposed Architecture
Workflow Design:
WorkflowID: f"provision-{project_id}-{request_id}"
Activities:
1. check_budget(project_id, estimated_cost)
→ Returns available budget amount
2. reserve_budget(project_id, estimated_cost, ttl=1h)
→ Returns reservation_id
→ Reservation expires if not confirmed within TTL
3. provision_resource(resource_spec)
→ Calls AWS/internal APIs
→ Retries with exponential backoff
→ Can run for minutes/hours
4a. On Success:
confirm_budget(reservation_id, actual_cost)
→ Deducts from budget permanently
4b. On Failure:
release_budget(reservation_id)
→ Compensating action, releases reservation
Key Patterns:
Workflow-per-Project Idempotency:
- WorkflowID deterministically includes project_id and request_id
- If the same provision request is submitted twice, Temporal recognizes duplicate ID and returns the existing workflow
- Prevents accidental double-provisioning
Real-Time Budget Queries:
- Other services query:
workflow.query("get_current_state")→ “PROVISIONING” | “COMPLETED” | “FAILED” - Finance dashboards query:
workflow.query("get_reservation_details")→ {amount, timestamp, resource_type} - Aggregated across all workflows for a project to show “total reserved budget”
Long-Running Reservations:
- Some provisions take hours (e.g., warming up large ML clusters)
- Temporal workflows can run for days/weeks without issues
- Reservation TTL can be extended via signals if needed
Compensating Actions (Saga Pattern):
- If provisioning fails after budget reservation,
release_budget()is automatically called - If provisioning succeeds but confirmation fails, manual intervention workflow is triggered
- All compensations logged for audit trail
Proof of Concept
I built a POC for EC2 Auto Scaling Group provisioning:
- Implemented a simple 5-activity workflow
- Integrated with mock budget service
- Demonstrated failure handling: if provision times out, reservation is automatically released
- Showed query pattern: external service queries workflow for reservation details
- Provided architecture documentation and operational runbook
Team Adoption:
- The team had never heard of Temporal before
- After the POC demonstration, engineering leadership approved Temporal for the entitlement platform
- Began staffing the implementation team
- I provided ongoing architectural guidance
What Would Have Happened Without This: The team would have built custom orchestration with Lambda + Step Functions, leading to:
- 6+ months debugging distributed state issues (lost reservations, double-charging, race conditions)
- Poor visibility into provision failures
- Difficulty implementing compensating actions correctly
- No idempotency guarantees, requiring application-level deduplication logic
Impact & Organizational Outcomes
Quantitative Results:
- 50,000 lines of production code (budget planning system)
- 30-40 DBT models migrated (of 150+ planned)
- 4.36 TB data warehouse now has proper domain model
- Thousands of cost centers managed through new planning system
- Hundreds of budget operations monthly through improved workflows
Architectural Foundation:
- Five bounded contexts provide clear separation for future development
- Domain model enables changes without rewriting pipelines
- Ports & Adapters architecture enables future microservices extraction
Risk Mitigation:
- Prevented 6+ months of distributed systems debugging by introducing Temporal
- Established testing discipline (integration tests in CI/CD)
- Created audit trail for financial compliance
Team Transformation:
- Introduced modern practices (DDD, testing, domain modeling) to a team that lacked fundamentals
- Established code review standards and design documentation
- Trained team on DBT, GraphQL, and distributed systems patterns
Future-Proofing:
- GraphQL API enables rapid iteration vs. monolithic REST
- Cedar policies provide flexible, auditable authorization
- DBT migration sets foundation for 100+ more models
- Temporal architecture scales to multi-cloud provisioning
Key Technical Trade-offs
1. Ports & Adapters vs. Simpler Patterns
Decision: Use hexagonal architecture despite team’s limited experience
Reasoning: Team would produce tightly-coupled code without forcing function. Testability was zero and needed structural discipline.
Trade-offs:
- ✅ Enabled comprehensive integration testing
- ✅ Future-proofs for microservices extraction
- ❌ Higher learning curve
- ❌ More initial boilerplate
2. GraphQL vs. REST
Decision: GraphQL for complex hierarchical queries
Reasoning: Planning cycles are deeply nested (Org → Platform → Resource → Monthly). REST would require many round trips or over-fetching.
Trade-offs:
- ✅ Flexible client queries, role-based data fetching
- ✅ Real-time subscriptions for collaboration
- ❌ More complex server implementation
- ❌ Team unfamiliar with GraphQL
3. DBT Migration Strategy (Incremental vs. Big Bang)
Decision: Incremental migration starting with native platforms
Reasoning: Rewriting all 50 pipelines simultaneously too risky. Native platforms (AWS) easier to model than multi-tenant platforms.
Trade-offs:
- ✅ Reduces risk, delivers value incrementally
- ✅ Proves patterns before scaling
- ❌ Maintains dual systems during transition
- ❌ Some technical debt persists longer
4. Temporal vs. Step Functions
Decision: Temporal for entitlement workflows
Reasoning: Need for state queries, signals, long-running workflows, and idempotency guarantees. Step Functions too limiting.
Trade-offs:
- ✅ Better developer experience and observability
- ✅ Flexible workflow patterns
- ❌ Operational overhead (self-hosted)
- ❌ Team learning curve
Lessons Learned
1. Architecture as a Forcing Function When teams lack fundamentals, choosing constraining patterns (like Ports & Adapters) forces better practices. The structure guides the team toward maintainable code.
2. Incremental Modernization Complete rewrites are too risky. The DBT migration’s incremental approach (macros first, then native platforms, then multi-tenant) reduced risk while delivering value.
3. Technology Evangelism Requires POCs The team had never heard of Temporal. A working proof-of-concept demonstrating actual value was more persuasive than architectural arguments.
4. Domain Modeling Enables Change Extracting implicit domain concepts (cost_center_hierarchy) from table structures enables future changes without rewriting pipelines. The upfront investment pays dividends.
5. Authorization as First-Class Concern Using Cedar policy language (vs. hardcoded checks) made authorization auditable, testable, and evolvable. Financial systems require this level of rigor.