Pinterest – PagerDuty Reliability Overhaul

S: 2023 | Staff Eng. Infra Gov. ~200 PagerDuty pages/month. Paging was email-based: Airflow callbacks sent emails, PagerDuty email integration ingested them, regex parsing inferred which pipeline failed. Failures couldn’t be routed to the right owning team—upstream platform data issues paged our team when they shouldn’t have.

T: Replace the paging system with proper failure classification and routing. Used this as a development opportunity for a junior engineer who was new to Airflow and infrastructure work entirely.

A1 – Diagnose: Mapped failure modes across pipelines. Found that most pages were noise: upstream platform data arriving late or malformed was triggering alerts that looked like attribution logic errors.

A2 – Joint Design: Paired with the junior engineer on the design: PagerDuty API integration using incident keys and service orchestration. Walked them through Airflow callback internals, PagerDuty’s API model (incidents, services, escalation policies), and how to use the low-level PD client (Pinterest Airflow < 2.6, no native support).

A3 – Ownership Transfer: Junior engineer owned the implementation. I reviewed their work and coached through specific challenges—how to structure the callback handler, how to map pipeline failures to incident keys, how to test the routing logic. They built the classification logic: upstream platform failures → platform team; attribution logic errors → our team.

A4 – Growth: After the project, the junior engineer became the team’s go-to person for PagerDuty and alerting. They independently owned the next operational tooling improvement without my involvement.

R: 200 pages/month → 5. Failures route to owning team. Team stopped being woken for problems they didn’t own. Junior engineer went from zero infra experience to independently owning operational tooling.

TRAITS: Reliability • Mentorship • Leadership (Q 21 28 29)