BTree in Postgres

PostgreSQL uses a B+Tree as the default index structure for primary keys, unique constraints, and user-defined indexes using the B+Tree type. The B+Tree is optimized for read-heavy workloads and provides O(log N) time complexity for search, insertion, and deletion operations. Unlike a binary tree, internal nodes store only keys and pointers, while leaf nodes contain the actual data or pointers to the tuples in the heap.

The B+Tree’s order determines how many keys fit in a node. Internal nodes direct queries to the correct leaf node by comparing keys. Each leaf node contains sorted keys and references to the corresponding rows in the heap, with sibling leaf nodes connected via a doubly-linked list to optimize range queries.

B-Tree Page Splitting and Structure

When a leaf node is full and a new entry must be inserted, PostgreSQL performs a page split. The database allocates a new page, moves approximately half of the keys from the full page to the new one, and updates the parent node with a new key that differentiates the two pages. This operation involves modifying multiple pages but remains efficient due to PostgreSQL’s fixed-size pages and the use of the Write-Ahead Log (WAL) for durability.

Structure of a B-Tree Page

A B-Tree page in PostgreSQL contains the following components:

Page Header: Stores metadata such as the page size, the number of tuples, and transaction visibility information.
Line Pointer Array: An array of pointers that reference the data items within the page. These pointers maintain the logical ordering of keys but the data itself can be placed anywhere within the page.
Tuple Data: The actual key-value pairs, typically arranged to maximize space utilization.
Free Space: Reserved space within the page to reduce the frequency of page splits, as controlled by the fill factor.

How Page Splitting Works

Page splits happen only when necessary, as each page reserves some free space (controlled by the fill factor) to reduce the frequency of splits. When a split occurs:

The B+Tree is traversed to locate the appropriate leaf page.
If the leaf page is full, a new page is allocated.
Approximately half of the tuples are moved to the new page.
The parent node is updated with a new key to distinguish between the old and new pages.
The sibling pointers between the affected leaf pages are adjusted to maintain the doubly-linked list.

This structure ensures efficient range scans and maintains the sorted order of keys. The use of fixed-size pages allows PostgreSQL to write the entire page in a single disk I/O operation, minimizing performance overhead.

Important

Page splits happen only when necessary, as each page reserves some free space (controlled by the fill factor) to reduce the frequency of splits. During the split, PostgreSQL maintains the integrity of the B+Tree structure by updating pointers in the parent node to reflect the new page.

The B+Tree indexes also adhere to the segmentation policy. The hierarchical structure of the FSM ensures efficient insert operations even when dealing with multi-segment tables.

VACUUM Process

VACUUM is essential in PostgreSQL due to its MVCC (Multi-Version Concurrency Control) model. Updates and deletes create dead tuples that occupy space until reclaimed by VACUUM. The process identifies dead tuples, removes them, and updates the FSM with the reclaimed space.

VACUUM operates in two modes:

Regular VACUUM: Removes dead tuples and updates FSM.
VACUUM FULL: Rewrites the entire table to reclaim space and defragment it.

When VACUUM runs, it scans pages, marks free space in the FSM, and updates the Visibility Map (VM), which tracks which pages are entirely visible to all transactions and thus do not need to be vacuumed frequently.

Edmondo's Vault

Explorer

BTree in Postgres

B-Tree Page Splitting and Structure

Structure of a B-Tree Page

How Page Splitting Works

VACUUM Process

Graph View

Table of Contents

Backlinks