Column Pruning

Parquet and ORC handle deeply nested structures differently, affecting how efficiently specific fields can be queried. Parquet eliminates the need for offset arrays by encoding repetition and definition levels alongside each value. This allows a query engine to scan only the necessary columns, skipping unrelated hierarchy levels without extra lookups.

ORC, in contrast, requires explicit offset arrays at every nesting level to define boundaries for repeated fields. Accessing a deeply nested field like items.product inside orders means first scanning orders_offsets to locate the correct order, then items_offsets to find the products. Even if the query only requests product, all parent offset arrays must be read first.

Example: Querying a Deeply Nested Field

Consider a dataset where each customer has multiple orders, and each order contains multiple items:

{
  "customer_id": 1,
  "orders": [
    {
      "timestamp": "2024-01-01",
      "items": [
        {"product": "Laptop", "price": 1000},
        {"product": "Mouse", "price": 20}
      ]
    },
    {
      "timestamp": "2024-02-01",
      "items": [
        {"product": "Keyboard", "price": 50}
      ]
    }
  ]
}

In ORC, to retrieve items.product, the system must first read:

orders_offsets to find order boundaries.
items_offsets to locate the products within each order.
Finally, extract the product values.

In Parquet, product is stored as a separate column, with repetition levels marking which order and which row they belong to. The query engine can scan only the product column, skipping all offsets and hierarchy tracking.

Toy Implementation

ORC (Using Offsets)

def orc_encode(data):
    products, prices, items_offsets, orders_offsets = [], [], [0], [0]
    item_count, order_count = 0, 0
    
    for customer in data:
        for order in customer.get("orders", []):
            for item in order.get("items", []):
                products.append(item["product"])
                prices.append(item["price"])
                item_count += 1
            items_offsets.append(item_count)
            order_count += 1
        orders_offsets.append(order_count)
    
    return products, prices, items_offsets, orders_offsets

Parquet (Using Repetition/Definition Levels)

def parquet_encode(data):
    products, prices, rep_levels, def_levels = [], [], [], []
    
    for customer in data:
        for order in customer.get("orders", []):
            for idx, item in enumerate(order.get("items", [])):
                products.append(item["product"])
                prices.append(item["price"])
                rep_levels.append(0 if idx == 0 else 1)  # 0 = new order, 1 = continued
                def_levels.append(1)
    
    return products, prices, rep_levels, def_levels

For workloads where only a subset of a nested structure is needed, Parquet’s shredding allows direct column access, reducing I/O and metadata scanning. ORC provides O(1) row lookup using offsets but incurs overhead when reading selective nested fields.

Handling Sparse Data

Sparse columns, where many values are NULL, are encoded differently. Parquet avoids dedicated null bitmaps, embedding nullability information directly inside the definition level stream. A column with 90% missing values only stores definition levels instead of allocating a separate bitmap.

ORC relies on explicit null bitmaps, meaning even a sparsely populated field must have a corresponding fixed-size null indicator. For highly nullable fields, this overhead can be significant, especially in deeply nested structures where each level requires a separate null bitmap.

Example: Sparse Data Encoding

Consider an items table with an optional discount field:

Row	Product	Price	Discount
1	Laptop	1000	50
2	Mouse	20	NULL
3	Keyboard	50	5

Unlike a bitmap, where each row has an explicit 0 or 1, definition levels are an enum that encodes presence inline with the data stream and can represent multiple levels of nullability.

ORC (Using Null Bitmaps)

def orc_encode_sparse(data):
    products, prices, discounts, null_bitmap = [], [], [], []
    
    for item in data:
        products.append(item["product"])
        prices.append(item["price"])
        if "discount" in item:
            discounts.append(item["discount"])
            null_bitmap.append(1)
        else:
            null_bitmap.append(0)
    
    return products, prices, discounts, null_bitmap

Parquet (Using Definition Levels)

def parquet_encode_sparse(data):
    products, prices, discounts, def_levels = [], [], [], []
    
    for item in data:
        products.append(item["product"])
        prices.append(item["price"])
        if "discount" in item:
            discounts.append(item["discount"])
            def_levels.append(1)  # Present value
        else:
            def_levels.append(0)  # NULL value
    
    return products, prices, discounts, def_levels

Definition Levels in Nested Structs

For deeply nested fields, Parquet’s definition levels avoid storing separate null bitmaps for each level. Instead of tracking NULL for each nested field independently, Parquet can store a single definition level per row that indicates if an entire struct is missing, saving space and processing overhead.

Example:

[
  { "user": null },
  { "user": { "age": null } },
  { "user": { "age": 30 } }
]

Definition Levels for user.age:

[0, 1, 2]
Values: [30]

0: The user struct is missing, so age is implicitly NULL.
1: The user exists, but age itself is NULL.
2: The age field exists and has a value (30).

Queries scanning mostly non-null rows may benefit from ORC’s structured bitmap approach, but for workloads with frequent NULLs and nested structures, Parquet’s definition-level encoding avoids redundant storage and lookup overhead, making it more compact and efficient.

Edmondo's Vault

Explorer