InnoDB Architecture

InnoDB is the default storage engine for MySQL. Understanding its internals is essential for diagnosing performance problems, tuning production systems, and reasoning about data durability. InnoDB is designed for OLTP (Online Transaction Processing) — high-concurrency workloads of short reads and writes with full ACID (Atomicity, Consistency, Isolation, Durability) guarantees.

Buffer Pool

The buffer pool is InnoDB’s most critical component: an in-memory cache of data and index pages. Nearly every read and write goes through it.

LRU with midpoint insertion. New pages enter the LRU list at 3/8 from the head, not at the head. This prevents a full table scan from evicting frequently accessed hot pages. Pages must survive a configurable window (innodb_old_blocks_time) before being promoted to the young sublist.

Key parameters:

ParameterPurpose
innodb_buffer_pool_sizeTotal buffer pool memory. Typically 70-80% of available RAM. The single most important MySQL tuning parameter.
innodb_buffer_pool_instancesNumber of partitions to reduce mutex contention. Rule of thumb: 1 instance per GB of buffer pool, up to 64.

Dirty pages are pages modified in memory but not yet flushed to disk. InnoDB’s checkpoint mechanism writes dirty pages in the background. If the dirty page ratio (innodb_max_dirty_pages_pct) gets too high, InnoDB forces aggressive flushing, which can cause latency spikes.

Monitoring the buffer pool

SHOW ENGINE INNODB STATUS\G shows buffer pool hit rate, dirty page count, and pages read/created/written. A hit rate below 99% on a warmed-up OLTP system usually means innodb_buffer_pool_size is too small.

Redo Log (WAL)

Write-Ahead Logging for crash recovery. All modifications are written to the redo log before data pages are modified on disk.

The redo log is a circular buffer. MySQL 8.0.30+ uses innodb_redo_log_capacity (replacing the old innodb_log_file_size and innodb_log_files_in_group pair).

The durability knobinnodb_flush_log_at_trx_commit:

ValueBehaviorRisk
1 (default)Flush to disk on every commitNone — full ACID compliance
2Flush to OS cache on commit, disk flush every ~1 secondUp to 1 second of data loss on OS crash
0Flush every ~1 second regardless of commitsUp to 1 second of data loss on any crash

Setting 2 is common in production

Many teams use innodb_flush_log_at_trx_commit = 2 for write-heavy workloads where replicas provide redundancy. Understand the tradeoff: you trade up to 1 second of committed transactions for significantly higher throughput.

Comparison with Postgres WAL

Both engines use write-ahead logging for crash recovery. The key difference: Postgres WAL doubles as the replication stream (streaming replication reads WAL directly). MySQL separates concerns — the redo log is InnoDB-internal, while the binlog (a server-level construct) handles replication. See Replication Architecture below.

Undo Log and MVCC

InnoDB implements MVCC through undo logs, allowing readers to see consistent snapshots without blocking writers.

Hidden row columns. Every InnoDB row carries two hidden fields:

  • DB_TRX_ID — the transaction ID of the last modification
  • DB_ROLL_PTR — a pointer into the undo log for the previous version

The undo log stores previous row versions as a linked list. A SELECT walks this chain to find the version visible to its read view.

Read views by isolation level:

Isolation LevelSnapshot TakenPhantom Protection
REPEATABLE READ (default)At first read in the transactionYes (via next-key locking)
READ COMMITTEDFresh snapshot per statementNo

Long-running transactions cause undo log bloat

A transaction that stays open for hours prevents InnoDB from purging old row versions. The “history list length” in SHOW ENGINE INNODB STATUS grows, undo tablespace balloons, and read performance degrades as queries must walk longer version chains. Monitor trx_rows_locked and trx_rows_modified in information_schema.innodb_trx.

Both MySQL and Postgres use MVCC

MVCC is not a Postgres-only feature. Both engines implement snapshot isolation through versioned rows. The difference is where old versions live:

  • Postgres stores old row versions in the heap itself (dead tuples), which is why it needs VACUUM to reclaim space.
  • InnoDB stores old versions in a separate undo tablespace, so the main tablespace stays clean — but undo purge lag is the equivalent problem.

Neither approach is better; they shift the maintenance burden. InnoDB’s REPEATABLE READ provides true snapshot isolation via MVCC (readers never block writers, writers never block readers), just like Postgres. The difference is that InnoDB also uses next-key locking to prevent phantom reads at REPEATABLE READ, while Postgres requires SERIALIZABLE isolation for that guarantee.

Doublewrite Buffer

Protection against partial page writes (torn pages).

InnoDB pages are 16KB, but filesystem blocks are typically 4KB. A crash mid-write could leave a page half-old, half-new — corrupted beyond redo log repair (the redo log stores physiological diffs — hybrid change records that describe a logical operation (e.g. “set byte 42 of page 7 to value X”) rather than copying the entire 16KB page — not full page images).

The doublewrite buffer solves this: pages are written to a sequential doublewrite area first, then to their actual locations. If a page is torn, crash recovery reads the clean copy from the doublewrite buffer.

Why Postgres doesn't need doublewrite

Postgres WAL includes full page images on the first modification after each checkpoint (full_page_writes = on). This makes torn page recovery possible from WAL alone, at the cost of larger WAL volume.

Change Buffer

Deferred updates to secondary index pages that are not in the buffer pool.

When an INSERT, UPDATE, or DELETE affects a secondary index page that is not currently cached, InnoDB buffers the change in the change buffer instead of performing a random disk read. The changes are merged later when the page is eventually loaded or during background merge operations.

Only non-unique secondary indexes

Unique indexes require a disk read to check for duplicates, so they cannot use the change buffer. This is why non-unique secondary indexes are cheaper to maintain on write-heavy tables — inserts avoid random disk I/O entirely until the page is needed for a read.

innodb_change_buffer_max_size controls the percentage of the buffer pool allocated (default 25%). For read-heavy workloads, reducing this frees memory for data caching.

Tablespaces

How InnoDB organizes data on disk:

TablespaceContents
System (ibdata1)Undo logs (pre-8.0), doublewrite buffer, change buffer, data dictionary (pre-8.0)
File-per-table (.ibd files)Default since MySQL 5.6. Each table gets its own file. Enables OPTIMIZE TABLE to reclaim space after bulk deletes.
GeneralUser-created tablespaces that can hold multiple tables. Useful for grouping related tables on specific storage.
TemporaryImplicit temp tables created during sorting, grouping, or complex joins.

File-per-table and space reclamation

With innodb_file_per_table = ON (the default), dropping a table or running OPTIMIZE TABLE returns space to the OS. Without it, space inside ibdata1 is never returned — it can only be reused internally. This is one of the most common “disk is full” surprises in legacy MySQL installations.

Locking

InnoDB uses row-level locking (unlike MyISAM’s table-level locks). The locking model is sophisticated:

Lock TypeWhat It Locks
Record lockA single index record
Gap lockThe gap between index records (prevents inserts into the gap)
Next-key lockRecord + gap before it. The default in REPEATABLE READ.

Next-key locking is how InnoDB prevents phantom reads in REPEATABLE READ without requiring SERIALIZABLE isolation. It locks not just the rows matching a predicate but also the gaps between them, preventing concurrent inserts that would create phantom rows.

Gap locking in practice

Given a table tasks with a composite index on (project, state):

SELECT * FROM tasks
WHERE project = 'backend' AND state = 'IN_PROGRESS'
FOR UPDATE;

This locks not just matching rows but also the gaps around them in the index, preventing another transaction from inserting a new ('backend', 'IN_PROGRESS') row until the lock is released.

Deadlock detection. InnoDB maintains a wait-for graph — a directed graph where each node is a transaction and each directed edge means “transaction A is waiting for transaction B to release a lock”; a cycle in this graph indicates a deadlock — and automatically detects deadlocks. The transaction with the fewest undo log records is rolled back as the victim. innodb_deadlock_detect = ON (default) enables this. For very high concurrency, some teams disable detection and rely on innodb_lock_wait_timeout instead (the graph traversal itself becomes a bottleneck).

Comparison with Postgres locking

Postgres uses SSI (Serializable Snapshot Isolation) for true serializability — it detects dangerous patterns in read/write dependencies and aborts transactions. InnoDB uses gap locking, which is more conservative (blocks more concurrent operations) but simpler to reason about. Postgres does not have gap locks; its REPEATABLE READ does not prevent phantoms (you need SERIALIZABLE for that). See Row-Level Security for Postgres’s complementary row-filtering mechanism.

Replication Architecture

InnoDB’s redo log and MySQL’s binlog serve different purposes:

LogScopePurpose
Redo logInnoDB engineCrash recovery
BinlogMySQL serverReplication + point-in-time recovery

The binlog is written at the server level (above the storage engine) and supports three formats:

  • Row-based (default in 8.0+): logs the actual row changes. Deterministic, safe, larger volume.
  • Statement-based: logs the SQL statements. Compact but non-deterministic for some functions (NOW(), RAND()).
  • Mixed: statement-based by default, switches to row-based when needed.

GTID (Global Transaction Identifier) assigns a unique ID to every transaction, making multi-source replication and failover reliable. Multi-source replication means a single replica pulling changes from more than one primary — useful for consolidating writes from multiple shards or data centres. With GTID, a replica knows exactly which transactions it has applied, regardless of binlog file positions.

Two-phase commit between redo log and binlog

InnoDB uses an internal two-phase commit (2PC) — a protocol that coordinates two separate write targets so they either both commit or both abort — to keep the redo log and binlog consistent. Phase 1 (prepare): InnoDB writes a “prepare” record to the redo log. Phase 2 (commit): MySQL writes the event to the binlog, then InnoDB writes “commit” to the redo log. If MySQL crashes between prepare and binlog write, the transaction is rolled back on recovery. If it crashes after binlog write, the transaction is committed. This is why sync_binlog = 1 paired with innodb_flush_log_at_trx_commit = 1 gives full durability.

See also