Partition Across Attributes

** The PAX paper is motivated by how CPUs actually access data. The CPU never reads a single byte from RAM — it reads in cache lines, typically 64 bytes. Accessing address X causes the hardware to fetch the entire 64-byte-aligned block containing X into L1 cache. A subsequent access to X+4 is free (~1ns), while accessing an unrelated address Y on a different cache line costs ~100ns (a round trip to RAM) or ~4-12ns if it happens to be in L2/L3.

┌──────────────────────────────────────────────────────────┐
│  L1 Cache     ~32 KB,   ~1ns   (per core)               │
│  L2 Cache     ~256 KB,  ~4ns   (per core)               │
│  L3 Cache     ~8-30 MB, ~12ns  (shared across cores)    │
│  RAM          ~64 GB,   ~100ns                           │
│  Disk (SSD)   TB,       ~10,000-100,000ns                │
└──────────────────────────────────────────────────────────┘

Sequential access through contiguous memory is fast because the hardware prefetcher detects the linear pattern and preloads upcoming cache lines. Scattered access defeats the prefetcher and causes cache misses, each stalling the CPU.

Databases store data in pages (typically 8KB). The buffer pool is a region of RAM caching recently-used disk pages. An 8KB page occupies 128 cache lines and fits comfortably in L2 (256KB). This fact is central to PAX’s design.

Storage Models

NSM (N-ary Storage Model)

NSM — named because all N attributes of a tuple are stored together as a single unit — stores complete tuples contiguously within a page. This is the model used by PostgreSQL, MySQL/InnoDB, and most OLTP systems.

8KB NSM Page
┌─────────────────────────────────────────────────────┐
│ Page Header                                         │
├─────────────────────────────────────────────────────┤
│ Line Pointers: [→T1, →T2, →T3, →T4, ...]          │
├─────────────────────────────────────────────────────┤
│                  ↑ free space ↓                     │
├─────────────────────────────────────────────────────┤
│ T4: [hdr][col_A=7 ][col_B="foo"][col_C=3.14]       │
│ T3: [hdr][col_A=42][col_B="bar"][col_C=2.71]       │
│ T2: [hdr][col_A=99][col_B="baz"][col_C=1.41]       │
│ T1: [hdr][col_A=1 ][col_B="qux"][col_C=0.57]       │
└─────────────────────────────────────────────────────┘
  (tuples grow from bottom up, line pointers from top down)

Each line pointer is a small offset (4 bytes) recording where a tuple starts on the page. A TID (Tuple ID) in PostgreSQL is (page_number, line_pointer_index) — this is what index entries point to. The tuple header in PostgreSQL contains xmin (creating transaction), xmax (deleting/updating transaction), ctid (pointer to next version in an update chain), infomask (visibility hint bits, null bitmap), all explained further below.

An index lookup like SELECT * FROM users WHERE id = 42 traverses the B-tree → gets a TID → fetches the page → follows the line pointer → reads a contiguous ~200-byte tuple. That’s 3-4 sequential cache lines, automatically prefetched. Zero waste.

An analytical scan like SELECT AVG(col_A) FROM big_table must walk every tuple on every page, loading the entire tuple (~200 bytes) into cache to read 4 bytes of col_A. That’s ~98% cache pollution — L1/L2 fills with irrelevant column data, evicting useful values.

DSM (Decomposition Storage Model)

DSM, from a 1985 paper by Copeland and Khoshafian, stores each column in a separate file. Each file’s pages contain only values for that column, packed contiguously.

File for col_A:        File for col_B:        File for col_C:
┌──────────────┐      ┌──────────────┐      ┌──────────────┐
│ 1, 42, 99,   │      │ "qux","bar"  │      │ 0.57, 2.71   │
│ 7, 13, 56,   │      │ "baz","foo"  │      │ 1.41, 3.14   │
│ ... (thousands│      │ ...          │      │ ...           │
│  of values)   │      │              │      │               │
└──────────────┘      └──────────────┘      └──────────────┘

A pure column scan is optimal: SELECT AVG(col_A) reads only the col_A file, every byte loaded is useful, and the prefetcher thrives on the sequential pattern.

Tuple reconstruction is the problem. Fetching all columns for row 42 requires reading a page from each column file. These pages are at different locations in RAM (different buffer pool frames, potentially megabytes apart), so cross-column access means cache misses at every step. A realistic multi-column query like SELECT col_B, col_C WHERE col_A > 100 must scan col_A pages, identify qualifying positions, then access col_B and col_C pages at those positions — each in a different RAM region, with cache lines from earlier files likely evicted by the time you reach later ones.

PAX (Partition Attributes Across)

PAX, introduced by Ailamaki et al. in 2001, keeps the page as the I/O unit (like NSM) but reorganizes data within each page into minipages — exactly one minipage per column. A table with 12 columns produces 12 minipages within each 8KB page. Each minipage holds all values for its column across every row stored on that page:

8KB PAX Page (holds ~200 rows, table has 3 columns → 3 minipages)
┌──────────────────────────────────────────────────────┐
│ Page Header + Minipage Directory                     │
├──────────────────────────────────────────────────────┤
│ Minipage col_A (200 × 4B = 800 bytes):               │
│   [1, 42, 99, 7, 13, 56, ...]                       │
├──────────────────────────────────────────────────────┤
│ Minipage col_B (variable-length, ~2KB):              │
│   ["qux", "bar", "baz", "foo", ...]                 │
├──────────────────────────────────────────────────────┤
│ Minipage col_C (200 × 8B = 1600 bytes):              │
│   [0.57, 2.71, 1.41, 3.14, ...]                     │
└──────────────────────────────────────────────────────┘

Row k on this page is reconstructed by reading position k from each minipage. No join, no cross-file stitching — just offset arithmetic within the same 8KB region.

At first glance, scanning col_A’s minipage inside a PAX page looks identical to scanning col_A’s page in a DSM file — both are contiguous runs of col_A values, both waste the same fraction of each cache line. The difference is about what else is in cache at the same time.

Why co-location within a page matters

Consider SELECT col_B, col_C FROM big_table WHERE col_A > 100.

In DSM, after scanning col_A’s page and finding qualifying row positions, you access col_B’s page and col_C’s page — each at a different RAM address in the buffer pool. Three 8KB pages across three separate memory regions means a 24KB working set scattered across RAM. Cache lines from col_A’s region may be evicted before you finish reading col_C, and the prefetcher cannot anticipate jumps between unrelated buffer pool frames.

In PAX, the entire 8KB page — containing all three minipages — sits in a single contiguous region of RAM. When you jump from col_A’s minipage to col_B’s minipage, you move ~800 bytes forward within the same cached 8KB region. That’s an L2 hit (~4ns), not a RAM fetch (~100ns). The working set is 8KB in one place, not 24KB scattered across three.

               DSM                              PAX
               ───                              ───
Pages in RAM   3 × 8KB at different             1 × 8KB at one
               buffer pool addresses            buffer pool address

Cache impact   24KB across 3 regions;           8KB in one region;
               L2 conflict misses likely        fits in L2, no conflicts

Cross-column   Jump to different RAM region     Jump ~hundreds of bytes
access cost    per column (~100ns if L2 miss)   within same page (~4ns)

The effect compounds during scans. Scanning col_A across 200 rows loads ~13 cache lines (800 bytes / 64). In DSM, these cache lines share L2 space with col_B’s and col_C’s cache lines from different buffer pool frames, causing conflict evictions. In PAX, everything is within one 8KB page — pure spatial locality with no conflicts.

Why OLTP Systems Do Not Use PAX

Full-tuple access is the dominant pattern

The typical OLTP query fetches one or a few complete rows via an index:

SELECT * FROM orders WHERE order_id = 12345

In NSM, the index returns a TID, you follow the line pointer, and the full tuple is a contiguous 200-byte region — 3-4 sequential cache lines, automatically prefetched. In PAX, the same operation requires reading position k from each of the N minipages. For a 20-column table, that’s 20 scattered reads within the page. Each is an L2 hit (~4ns), but 20 × 4ns = ~80ns, versus ~4ns for NSM’s single sequential read. PAX is strictly worse for the dominant OLTP access pattern.

MVCC and tuple versioning

PostgreSQL implements MVCC (Multi-Version Concurrency Control) by storing multiple physical copies of each tuple on the heap. There is no separate version store — old and new versions coexist as distinct tuples on heap pages. Each tuple carries its own version metadata in its header: xmin (the transaction that created this version), xmax (the transaction that deleted or updated it, 0 if still live), and infomask (bit flags with visibility hints).

When any backend reads a tuple, it performs a visibility check before returning it: is xmin committed and visible to my snapshot? Is xmax absent or not yet committed in my snapshot? The infomask bits provide a fast path — once a tuple has been checked and found definitely visible or definitely dead, hint bits are set so future readers skip the full transaction-log lookup.

In NSM, this works cleanly because each tuple is a self-contained unit: header + data, contiguous in memory. The visibility check reads the first ~23 bytes (the header), and if the tuple is invisible, the backend skips the rest entirely — the data columns are never loaded into cache. If the tuple is visible, the data follows immediately in the same cache lines.

In PAX, the tuple is not a unit — it’s scattered across N minipages. The version metadata (xmin, xmax, infomask) must be stored somewhere, and every option creates problems.

The natural PAX approach would be to create a dedicated header minipage that stores all tuple headers contiguously, just as each data column gets its own minipage. Position k in the header minipage corresponds to position k in every data minipage. A visibility check then reads position k from the header minipage, and only if the tuple is visible does it proceed to read data from the other minipages.

For scans, this is actually fine — you’d scan the header minipage first, build a bitmask of visible rows, then scan only the relevant data minipages using that bitmask.

The problems emerge with point lookups and writes. For a point lookup via index, an NSM backend does one contiguous read: header check + data retrieval happen within the same few cache lines. A PAX backend must first read from the header minipage (one location on the page), then jump to N different minipages (N other locations) to gather the data. These are all L2 hits (same page, fits in cache), but the access pattern is scattered rather than sequential, defeating the prefetcher and requiring more cache line loads.

More subtly, PostgreSQL lazily sets hint bits on tuples during reads. The first backend to see that a tuple’s inserting transaction has committed will set HEAP_XMIN_COMMITTED in the infomask. This is a write during a read path: the page gets dirtied, which means it must eventually be flushed to disk, and the modification must be WAL-logged (if wal_log_hints is on or checksums are enabled). In NSM, this dirty is a small write to a byte within the tuple — the tuple’s cache line was already loaded. In PAX, the hint bit lives in the header minipage, which might be on a different cache line than the data minipages the backend just read. The page is still 8KB and in L2, so this isn’t catastrophic, but it adds scattered dirty writes to the page on a path that’s supposed to be read-mostly.

The deeper problem is updates. When PostgreSQL updates a tuple, it doesn’t modify the existing tuple in place. It writes an entirely new tuple version (with a new xmin = current transaction, xmax = 0) and sets the old tuple’s xmax to the current transaction. In NSM, the new tuple is a single contiguous write, placed in free space on the same page if possible. The old tuple gets one field changed (xmax), which is a write to its header bytes.

In PAX, writing a new tuple version means appending a value to each of the N minipages: the new col_A value goes to position k_new in col_A’s minipage, the new col_B value to position k_new in col_B’s minipage, and so on, plus a new header entry at position k_new in the header minipage. Meanwhile, the old version’s header at position k_old in the header minipage must have its xmax set. This is N+1 scattered writes within the page for what NSM accomplishes with 2 contiguous writes (one new tuple, one xmax update on the old tuple).

An alternative design would store version information per-minipage — each minipage entry could carry its own xmin/xmax. But this means a single logical tuple has N independent version markers. During a crash or rollback, some minipages might reflect the new version while others still hold the old. Ensuring that all N minipages agree on which version is current requires either an atomic multi-minipage write protocol (complex, latch-intensive) or a reconciliation step during recovery (fragile, slow). No production system has found this trade-off worthwhile.

HOT (Heap-Only Tuples) in PostgreSQL

HOT is one of PostgreSQL’s most important write-path optimizations, and it depends fundamentally on NSM’s contiguous tuple layout.

The problem HOT solves: in PostgreSQL, every index contains pointers (TIDs) to heap tuples. When a tuple is updated, a new version is written, and normally every index on the table must be updated to also point to the new version. For a table with 5 indexes being updated 1000 times/second, that’s 5000 index insertions/second just for bookkeeping.

HOT eliminates these index updates when two conditions are met: the update does not change any indexed column, and the new tuple version fits on the same heap page as the old one. When both hold, PostgreSQL:

Writes the new tuple version on the same page as the old one, in the normal free-space region
Sets the old tuple’s ctid field to point to the new tuple’s line pointer index (forming a HOT chain)
Does not update any index — all indexes still point to the original (root) tuple’s TID

When a later index scan follows a TID to the root of a HOT chain, it finds the old tuple, checks its ctid, follows the chain forward to the newest version, and performs visibility checks along the way to find the version appropriate for its snapshot.

Index entry → TID (page 57, slot 3)

Page 57:
  slot 3 → T3_v1 [xmin=100, xmax=200, ctid=(57,7)]  ← old version
  slot 7 → T3_v2 [xmin=200, xmax=300, ctid=(57,9)]  ← middle version
  slot 9 → T3_v3 [xmin=300, xmax=0,   ctid=(57,9)]  ← current version

Eventually, when old versions are no longer visible to any transaction, pruning (a lightweight mini-VACUUM that happens during normal page access) removes dead versions and collapses the chain so slot 3 redirects directly to the live version.

This entire mechanism relies on tuples being self-contained, contiguous, independently writable units on the page. Writing a new version is one contiguous write in the free-space region. Setting the old version’s ctid is a small modification to the old tuple’s header. The chain is just a linked list of line pointer indices within the page, each pointing to a complete tuple.

In PAX, a tuple is not a unit — it’s N values at position k across N minipages. Writing a new version means appending one value to each minipage. But the HOT chain is a sequence of tuple positions: k_old → k_new → k_newer. Following the chain means: go to position k_old in the header minipage, read ctid, find it points to position k_new, go to position k_new in the header minipage, read its ctid, eventually find the live version at position k_live, then read position k_live from each data minipage.

This is mechanically possible, but the pruning step becomes far more complex. In NSM, pruning a dead tuple means marking a contiguous region as free — the line pointer is set to “dead,” and the space occupied by the tuple bytes is reclaimable. In PAX, pruning position k_old means reclaiming space at position k_old in each minipage independently. If minipages are packed arrays (values stored contiguously for cache efficiency), removing an element from the middle requires either shifting subsequent elements (expensive, and invalidates all positions after k_old, which breaks other HOT chains and index TIDs) or leaving a gap (fragmentation within the minipage that accumulates over time).

NSM avoids this because tuples don’t have a meaningful “position” within the data region — they’re placed wherever there’s free space, and the line pointer indirection decouples their physical location from their logical identity. In PAX, the position k is meaningful because it’s the index into every minipage simultaneously. You can’t move one column’s value without moving all of them, and you can’t leave gaps without degrading the very cache locality that PAX was designed to provide.

WAL (Write-Ahead Logging)

Before modifying any page, the database writes a WAL record describing the change for crash recovery. In NSM, a tuple insert produces one record containing the complete tuple as a contiguous byte sequence:

WAL record: INSERT on page 57
  offset: 7904
  length: 200
  data:   [23 bytes header | 4 bytes col_A | 20 bytes col_B | 8 bytes col_C | ...]

Recovery replays this as a single memcpy of 200 bytes to offset 7904 on page 57.

In PAX, inserting one tuple means appending a value to each of the N minipages, each at a different offset within the page. For a 12-column table, that’s 12 writes at 12 different page offsets. The WAL must capture all of them, and the options are:

Option 1 — one WAL record per minipage write. This produces N WAL records per tuple insert. For a 12-column table with 1000 inserts/second, that’s 12,000 WAL records/second instead of 1,000. WAL volume grows proportionally, increasing I/O pressure on the WAL segment and replication lag on standbys. Worse, atomicity within a page becomes a concern: if the system crashes after writing WAL records for 7 of 12 minipages, recovery replays those 7 and the page is left in an inconsistent state where some minipages have the new row at position k and others don’t. You’d need a mechanism to detect and roll back partial tuple inserts at the page level — something NSM never has to worry about, because a tuple insert is a single atomic write.

Option 2 — one compound WAL record. A single WAL record describes all N minipage modifications:

WAL record: PAX_INSERT on page 57
  minipage_A offset 800, write 4 bytes: [col_A value]
  minipage_B offset 2048, write 20 bytes: [col_B value]
  minipage_C offset 4864, write 8 bytes: [col_C value]
  minipage_hdr offset 64, write 23 bytes: [header]
  ... (8 more entries for remaining columns)

This preserves atomicity — replay either applies all minipage writes or none. But the WAL record format is more complex (a variable-length list of offset/length/data triples instead of a single offset/length/data), the replay logic must iterate over entries and apply each one, and the record itself is larger due to repeated offset metadata. Every downstream system that reads WAL — streaming replication, logical decoding, pg_rewind, backup tools — must understand this new record format.

Neither option is unworkable in isolation. The problem is that this added complexity permeates the most correctness-critical subsystem in the database, where bugs cause data loss, and the benefit for OLTP workloads (where PAX’s cache advantage barely materializes) doesn’t justify the risk.

Variable-length column management

NSM pages have a simple space allocation model: tuples grow from the bottom of the page, line pointers grow from the top, they meet in the middle. Variable-length fields within a tuple are handled by an offset array at the start of the tuple’s data region. Dead tuples leave gaps that are compacted by VACUUM or reused by future inserts.

PAX pages need per-minipage space management. A fixed-length minipage (e.g., INT values) is straightforward — a packed array where position k is at base + k × sizeof(int). But a variable-length minipage (e.g., VARCHAR, TEXT) needs its own offset array and free-space tracking, because values have different sizes.

The fragmentation problem is unique to PAX: when the VARCHAR minipage fills up but the INT minipages still have room, the page cannot accept new tuples even though the page has free bytes overall. In NSM, free space is fungible — any free byte on the page can be used by any new tuple. In PAX, free space is partitioned across minipages and not interchangeable.

This means either over-provisioning each minipage with slack space (wasting capacity), dynamically resizing minipages at insert time (expensive, requires moving data), or accepting lower page fill factors (more pages, more I/O). Under OLTP write loads with variable-length data — which is most real-world schemas — this creates fragmentation that NSM avoids entirely.

Latch contention

OLTP systems protect pages with lightweight latches (not transactional locks — latches ensure physical consistency during concurrent access). In NSM, a writer latches the page, writes a contiguous ~200-byte tuple at one location, and releases. The critical section is small: one memcpy plus a line pointer update.

In PAX, a writer latches the same page but modifies N different offsets — one per minipage. Each minipage write dirties a different cache line, and the bookkeeping (updating N minipage free-space pointers, a minipage directory, etc.) takes more instructions. The critical section under the latch is longer, which means other backends waiting to access the same page are blocked for longer. Under high-concurrency OLTP workloads where hot pages are contended (e.g., the tail of a time-series table), this increased latch hold time directly reduces throughput.

Why PAX Succeeded in Analytical Formats

Apache Parquet and ORC are PAX’s idea applied to immutable batch-written files for distributed analytics.

PAX Concept	Parquet / ORC Equivalent
Page (I/O unit containing all columns for a group of rows)	Row Group (Parquet) / Stripe (ORC)
Minipages (one per column within the page)	Column chunks within a row group/stripe
Cache-friendly columnar scans within the page	Predicate pushdown, column pruning
Tuple reconstruction without cross-file joins	Reconstruction within row group/stripe by position

Every hard OLTP problem disappears in this context. There is no MVCC — files are immutable once written. There is no WAL — files are written atomically, no crash recovery within a page. There is no latch contention — reads are concurrent, writes produce new files. There is no HOT chain maintenance — there are no updates at all. There is no variable-length fragmentation — the writer knows all the data upfront and can lay out column chunks with exact sizing.

On top of PAX’s co-location insight, these formats add compression per column chunk (same-type data compresses far better than mixed-type tuples), encoding schemes (dictionary, run-length, delta), rich per-column statistics (min/max, bloom filters) for query planning, and nested data support (Parquet uses Dremel’s repetition/definition levels).

PAX’s benefit is proportional to how many rows you scan, how few columns you project, and how little you write. Analytical workloads maximize all three. OLTP workloads are the exact inverse — few rows, all columns, heavy writes — which is why production OLTP systems universally use NSM.

Edmondo's Vault

Explorer