OLAP compresison techniques

Run-Length Encoding (RLE)

Run-Length Encoding compresses repeating values by storing the value once and tracking its frequency. This is particularly effective in sorted or low-cardinality datasets where long sequences of identical values appear consecutively.

A naive implementation:

from itertools import groupby
 
def rle_encode(data):
    return [(key, len(list(group))) for key, group in groupby(data)]
 
print(rle_encode([1, 1, 1, 2, 2, 3, 3, 3, 3]))
# [(1, 3), (2, 2), (3, 4)]

While efficient for query execution (fewer comparisons during scans), RLE struggles when values change frequently, leading to fragmentation and reduced effectiveness.

Dictionary Encoding (DICT)

Dictionary encoding replaces unique values with small integer identifiers, reducing storage overhead and allowing fast lookups. This is effective in high-cardinality datasets with repeated values.

def dict_encode(data):
    dictionary = {val: idx for idx, val in enumerate(sorted(set(data)))}
    encoded = [dictionary[val] for val in data]
    return dictionary, encoded
 
values = ['apple', 'banana', 'apple', 'orange', 'banana']
print(dict_encode(values))
# ({'apple': 0, 'banana': 1, 'orange': 2}, [0, 1, 0, 2, 1])

For fast scans, query engines translate the encoded values back using the dictionary, leveraging efficient integer-based operations instead of string comparisons.

RLE-DICTIONARY Encoding (RLE_DICT)

This combines RLE and dictionary encoding, where dictionary-encoded values are further compressed using RLE. This is beneficial when highly repetitive values appear in long sequences but are not purely sorted.

Consider the sequence:

"apple", "apple", "apple", "banana", "banana", "apple", "apple"

With dictionary encoding:

[0, 0, 0, 1, 1, 0, 0]

With RLE applied to the dictionary-encoded values:

[(0, 3), (1, 2), (0, 2)]

This provides the benefits of both dictionary lookup efficiency and RLE compression.

Delta Encoding

Delta encoding stores differences between consecutive values instead of absolute values, significantly reducing storage when values have a small range of differences.

def delta_encode(data):
    return [data[i] - data[i - 1] for i in range(1, len(data))]
 
values = [100, 105, 110, 120]
print(delta_encode(values))
# [5, 5, 10]

For queries, prefix sums restore the original values efficiently. This technique is used in timestamps, sorted numeric columns, or monotonic sequences.

Bitpacking

Bitpacking encodes integer values using the minimal number of bits necessary instead of standard 32-bit or 64-bit representations. This is particularly useful for storing dense numeric data where values fall within a known range.

Instead of storing each integer using a full 32-bit or 64-bit type, an OLAP DBMS determines the minimum bit-width needed and stores data accordingly. This significantly reduces memory footprint while maintaining performance.

For example, if values range from 0 to 31, they only require 5 bits instead of 32:

import numpy as np
 
def bitpack(data, bit_width):
    packed = np.packbits(np.array(data, dtype=np.uint8))
    return packed
 
values = [1, 3, 7, 15]
print(bitpack(values, 4))

To reconstruct the values, a shift and mask operation is applied during query execution. Some OLAP databases may also store an additional metadata field that records the bit width, min value, or shift factor, allowing efficient decompression while keeping storage requirements low.

This method is frequently combined with Frame-of-Reference (FOR) encoding for even greater efficiency.

Frame-of-Reference (FOR) Encoding

Frame-of-Reference (FOR) encoding is a variation of delta encoding, but instead of storing differences, it stores values relative to a reference point.

Instead of storing:

100, 105, 110, 120

FOR stores:

(100, [0, 5, 10, 20])

This allows fixed-width storage of offsets, reducing space while enabling fast access. FOR works well in numeric datasets where values are clustered around a central range.

Edmondo's Vault

Explorer