Claims Modernization – Ed’s Role & Impact
Situation:
- In September 2024, Ed took on the role of most senior IC in the Fraud department (250+ engineers), particularly to support Claims Modernization.
- The project was mission-critical due to Capital One’s acquisition of Discover, which required implementing several flows that couldn’t be easily handled by the old system.
- To accelerate delivery, 16 teams of contractors were staffed, but the rushed setup led to organizational and technical challenges.
- The first milestone (December 2024) focused on migrating cases ineligible for rebills.
- The second milestone (February 2025) focused on migrating cases eligible for rebill.
- Ed’s primary task was to help the department succeed in delivering the project.
Task:
- Ensure the Claims Modernization project was successfully delivered while mitigating technical and organizational risks.
- Navigate leadership expectations and technical execution to prevent costly failures.
- Support multiple teams in aligning towards a common goal despite varying engineering maturity levels.
Action:
- Engaged in direct mentoring, code reviews, and personal coding to address immediate technical issues (lack of tests, poor code quality, weak abstraction).
- Led technical reviews and provided mentorship, improving engineering quality across the board.
- Identified critical engineering gaps early, including misuses of AWS Step Functions and weak unit testing practices.
- Advocated for Domain-Driven Design (DDD), particularly Ports and Adapters, to ensure architectural scalability and maintainability.
- Recognized that lack of role clarity and responsibility was a key blocker—teams were hired while the project was already running, leading to insufficiently experienced technical leads.
- Shifted focus to improving organizational clarity, ensuring teams had well-defined responsibilities alongside technical improvements.
- Aligned leadership and engineers by balancing execution speed with the need for sustainable technical choices.
Result:
- First milestone (December) was missed, but the delay was anticipated due to early recognition of engineering maturity gaps.
- Second milestone (February) was successfully deployed in production.
- The new system eliminated stuck cases, reducing manual agent workload and creating a 2 bps efficiency improvement on 34M/year).
Technical Complexity & Architectural Decisions (For Datadog)
- The legacy system was a monolithic Java application, which was difficult to modify and scale. Migrating to a microservices architecture required incremental adoption, with a hybrid model routing traffic between the old and new systems.
- The project involved orchestrating complex decisioning workflows, where fraud claims had multiple fallback paths and failure recovery mechanisms., however some of the interactions were asynchronous and lacked usage of best practices for asynchronous message passing (i.e. retry after timeout)
- The Step Function-based orchestration was suboptimal, as it tightly coupled workflow logic with implementation details. Ed pushed for Temporal, which allowed better separation of concerns and improved visibility into long-running workflows.
- A major challenge was distributed transaction management—Ed introduced Saga patterns to ensure consistency without relying on expensive distributed locks.
- The data layer was redesigned with CQRS (Command Query Responsibility Segregation) to separate transactional writes from analytical queries, improving both performance and maintainability.
- Several microservices were initially built as CRUD services, but event sourcing was a more appropriate model. Ed worked to transition key services toward an event-sourcing approach to better capture business processes.
- There was a CRUD microservice without clear domain boundaries, which became a bottleneck. Ed unbundled this service into three distinct microservices:
- Claim Lifecycle Management (handling state transitions)
- Policy Evaluation (applying fraud detection logic)
- Decision Fulfillment (executing final claim resolutions)
- Observability was critical: tracing and logging improvements were implemented to track claims across the legacy and modernized system, reducing debugging time and improving incident resolution speed.
Lessons Learned:
- Technical improvements alone don’t drive success—organizational clarity is equally critical. Aligning team responsibilities with technical execution ensures faster, more effective delivery.
- Large-scale projects with newly assembled teams need stronger upfront leadership definition. The lack of experienced technical leads created ambiguity, delaying decision-making and execution.
- Balancing execution with long-term architecture is crucial—rushing delivery without scalable design leads to complexity, while over-architecting can slow progress.
- Early recognition of systemic technical debt helped mitigate risks, reinforcing the importance of integrating risk assessment into project roadmaps.
- Defining clear technical leadership roles from the start prevents misalignment and reduces inefficiencies in execution.
Would you like to refine this further or emphasize additional aspects?