Introduction
Concurrency control is required to avoid lost updates but also correct recovery
A transaction is a sequence of read and write operations on database objects such as tuples, pages, tables, attributes, or even databases, whose boundaries are set by the client. The acronym ACID is used to define the criteria used to ensure correctness of a database, although some databases do not implement fully acid guarantees.
Atomicity
To ensure that transactions execute all its actions or none there are two approaches:
- Logging: all actions are logged so actions for aborted transactions can be undone. The database maintain undo records both on memory and disk. This is common since logging is also important for audit
- Shadow paging: the database makes copies of pages modified by transactions, and transactions than modify those pages. When a transaction commits the page is made visible. This is typically slower than a logging-based approach, but if you have a single thread there is no need to write a log to disk and recovery is simple, since all the pages for uncommitted transactions will be discarded
Important
In the real world, Runtime performance is almost always more important than recovery performance, so shadow paging is rarely used in practice
Consistency
If a database implement correctly consistency, all queries that the application asks about the data will return logically correct results. There are two types of consistency:
- Database consistency: The database accurately represent the real world entity it is modeling and follow integrity constraints. Future transactions read past transaction results
- Transaction consistency: If a database is consistent before a transaction starts, it will also be consistent after.
Important
Transaction consistency is responsibility of the application. I.e. if you have a wire transfer application, it’s up to you to update the balance on both accounts within a single transaction
Isolation
The DBMS provides transactions the illusion that they are running alone in the system: this is equivalent to a system where transactions are executed one at the time(serial order). However, to achieve better performance, the DBMS has to interleave the operations of concurrent transactions while maintaining the illusion of isolation
Concurrency control protocol
A concurrency control protocol is how the DBMS decide the proper interleaving of operations from multiple transactions. There are two categories:
- Pessimistic: the DBMS assumes conflicts are common and doesn’t let problem arise
- Optimistic: the DBMS assumes conflicts are rare and deal with them when they happen
Conflicts
When two operations for different transaction occur on the same object, a conflict arise. There are three type of conflicts:
- Read-Write conflict: Unrepeatable read. Transaction A reads a value and then it reads it again. The second time it will get a different value because B has updated it.
- Write-Read conflict (Dirty Read). Transaction A writes a value but hasn’t committed yet, and Transaction B reads that value.
- Write-Write conflict (Lost updates): Transaction A writes a value that hasn’t committed yet, and Transaction B overwrites it
Pessimistic concurrency control
The order in which the DBMS executes operations is called an execution schedule. A serial schedule, which doesn’t interleave actions of different transactions would guarantee output correctness, but it would be too slow. Therefore we would like to find a schedule that for any database state produces the same effects as a serial schedule (we call it an equivalent schedule) but maximizes concurrency. We call a serializable schedule a schedule that is equivalent to any serial execution (not one in particular, multiple results can still happen, but they are all correct)
To generate serializable schedules there are practical definitions we can adopt: (1) conflict and (2) view. Neither definition allows all schedules that one would consider serializable. In practice, DBMSs typically implement Two phase locking, a protocol that creates conflict-serializable schedules, while algorithms that use view-serializable schedules have been abandoned because they are too complicated.
Conflict and view serializability
Two schedules are conflict equivalent if they involve the same operations of the same transactions, and every pair of conflicting operations is ordered in the same way in both schedules. A schedule is conflict serializable if it is conflict equivalent to a serializable schedule
Important
For schedules with many transactions, instead of swapping all non-conflicting operations until a serial schedule is formed, the schedule is verified through a dependency graph. Edges are established between pair of operations that conflict, and the schedule is conflict serializable if the graph is acyclic
View serializability is a weaker notion that allows for all schedules that are conflict serializable but also the ones that involve blind writes, i.e. setting a value without reading it first. In practice, although this allow for more schedule, is difficult to ?
Durability
Similar to Atomicity, making all committed transactions persistent after a crash or a restart can be achieved via logging or shadow paging