Aurora DSQL builds on the ideas that Amazon has developed with Aurora. While traditional Aurora only focuses on optimizing storage and replication layers of PostgreSQL, keeping the core engine largely monolithic, Aurora DSQL takes a completely disaggregated approach, breaking down the database into independent, horizontally scalable layers:
- Session Management: Handles client connections, authentication, and load balancing.
- Query Processing: Executes SQL queries using the PostgreSQL engine, with each session running in an isolated Firecracker microVM for security and rapid scaling. This allows for scaling the compute resources independently of storage.
- Concurrency Control and Replication: Managed by a distributed adjudicator service for optimistic concurrency control and an internal Journal service for atomic, durable, and replicated writes.
- Storage: Provides efficient data querying and indexing, with the crucial distinction that durability and concurrency control are offloaded to the Journal and adjudicator layers respectively.
Routing layer
The Session Management layer acts as a transaction and session router, similar to PgBounce
Journal
The Journal is a distributed log service that AWS has developed over the years, and use it for persistence allows scaling writes. It is an atomic, distributed, and scalable replication system that ensures durability of committed transactions across multiple AZs or Regions (Brooker, re:Invent). It underpins services like S3, DynamoDB, and Kinesis.
Aurora vs Aurora DSQL
Traditional Aurora writes is still tied to a primary instance, while the sharded adjudicator and the distributed nature of the journal allows highly scalable writes and active-active, strongly-consistent multi-region deployments
Writes and concurrency control
Aurora DSQL uses an optimistic concurrency control protocol managed by the distributed adjudicator service. It employs Strong Snapshot Isolation(SSI), which is equivalent to PostgreSQL’s Repeatable Read level but with the added guarantee of strong consistency (linearizability).
Reads operate on a consistent snapshot based on the transaction’s start time, achieved using MVCC at the storage layer, leveraging the AWS Time Sync Service for consistent read snapshots across the distributed storage.
Important
This differs from PostgreSQL’s MVCC implementation.
Writes are buffered locally within the query processor until commit time (re:Invent). When commit times arrives (optimistic concurrency control!), the Adjudicator service checks for write-write conflicts between concurrent transactions at commit time.
write-write conflicts only
If you have an MVCC architecture, you don’t need to manage read-write or write-read conflicts
How the adjudicator works
It maintains a mostly in-memory version map within a limited time window (currently 5 minutes) to efficiently detect conflicts (Brooker, re:Invent). However, if the transaction spans multiple shards, it uses a variant of two-phase commit. The 5-minutes helps by bounding the amount of version history the adjucator needs to track for conflict detection making lookups efficient and the adjudicator more scalable, simplifying also recovery and failover
Read and scalability of SQL execution
Read-only transactions do not require coordination with adjudicator and can be execute locally with very low latency, since the storage offers query pushdown and MVCC means you can simply not conflict with any other operation
Important
Since SQL is an arbitrary language, each Postgres instance is executed within a VM, to allow secure multi-tenancy. Firecracker microVMs technology, developed by Amazon for Lambda, is used to provide fast start-up time
Durability and multi-region capabilities
DSQL supports active-active multi-region deployments, typically with two active regions and a journal-only witness region to ensure consistency and availability during network partitions (re:Invent). A transaction is considered durable when it is stored across multiple AZs in at least two out of the three regions
Regional isolation
In case of a regional isolation, the majority side (containing at least two of the three region) remains available and consistent, and potentially a journal-only witness need to be promoted to active. The isolated region becomes unavailable until it recovers and catches up with journal replicaion
Technology choices
AWS heavily leveraged its internal “AWS Time Sync Service” for highly accurate time synchronization, which simplifies building distributed databases with good performance.
The system is built using Rust for performance and memory safety (re:Invent). Significant investments were made in deterministic simulation testing (using the open-source “Turmoil” library), fuzzing of the SQL surface area, and formal methods (using TLA+ and P) to ensure the correctness and reliability of the distributed system (re:Invent).
Confusing:
- two-phase commit
- what does it mean that most of the transaction execution is not even crossing the AZ?