Parquet and ORC handle deeply nested structures differently, affecting how efficiently specific fields can be queried. Parquet eliminates the need for offset arrays by encoding repetition and definition levels alongside each value. This allows a query engine to scan only the necessary columns, skipping unrelated hierarchy levels without extra lookups.
ORC, in contrast, requires explicit offset arrays at every nesting level to define boundaries for repeated fields. Accessing a deeply nested field like items.product inside orders means first scanning orders_offsets to locate the correct order, then items_offsets to find the products. Even if the query only requests product, all parent offset arrays must be read first.
Example: Querying a Deeply Nested Field
Consider a dataset where each customer has multiple orders, and each order contains multiple items:
{
"customer_id": 1,
"orders": [
{
"timestamp": "2024-01-01",
"items": [
{"product": "Laptop", "price": 1000},
{"product": "Mouse", "price": 20}
]
},
{
"timestamp": "2024-02-01",
"items": [
{"product": "Keyboard", "price": 50}
]
}
]
}In ORC, to retrieve items.product, the system must first read:
orders_offsetsto find order boundaries.items_offsetsto locate the products within each order.- Finally, extract the
productvalues.
In Parquet, product is stored as a separate column, with repetition levels marking which order and which row they belong to. The query engine can scan only the product column, skipping all offsets and hierarchy tracking.
Toy Implementation
ORC (Using Offsets)
def orc_encode(data):
products, prices, items_offsets, orders_offsets = [], [], [0], [0]
item_count, order_count = 0, 0
for customer in data:
for order in customer.get("orders", []):
for item in order.get("items", []):
products.append(item["product"])
prices.append(item["price"])
item_count += 1
items_offsets.append(item_count)
order_count += 1
orders_offsets.append(order_count)
return products, prices, items_offsets, orders_offsetsParquet (Using Repetition/Definition Levels)
def parquet_encode(data):
products, prices, rep_levels, def_levels = [], [], [], []
for customer in data:
for order in customer.get("orders", []):
for idx, item in enumerate(order.get("items", [])):
products.append(item["product"])
prices.append(item["price"])
rep_levels.append(0 if idx == 0 else 1) # 0 = new order, 1 = continued
def_levels.append(1)
return products, prices, rep_levels, def_levelsFor workloads where only a subset of a nested structure is needed, Parquet’s shredding allows direct column access, reducing I/O and metadata scanning. ORC provides O(1) row lookup using offsets but incurs overhead when reading selective nested fields.
Handling Sparse Data
Sparse columns, where many values are NULL, are encoded differently. Parquet avoids dedicated null bitmaps, embedding nullability information directly inside the definition level stream. A column with 90% missing values only stores definition levels instead of allocating a separate bitmap.
ORC relies on explicit null bitmaps, meaning even a sparsely populated field must have a corresponding fixed-size null indicator. For highly nullable fields, this overhead can be significant, especially in deeply nested structures where each level requires a separate null bitmap.
Example: Sparse Data Encoding
Consider an items table with an optional discount field:
| Row | Product | Price | Discount |
|---|---|---|---|
| 1 | Laptop | 1000 | 50 |
| 2 | Mouse | 20 | NULL |
| 3 | Keyboard | 50 | 5 |
Unlike a bitmap, where each row has an explicit 0 or 1, definition levels are an enum that encodes presence inline with the data stream and can represent multiple levels of nullability.
ORC (Using Null Bitmaps)
def orc_encode_sparse(data):
products, prices, discounts, null_bitmap = [], [], [], []
for item in data:
products.append(item["product"])
prices.append(item["price"])
if "discount" in item:
discounts.append(item["discount"])
null_bitmap.append(1)
else:
null_bitmap.append(0)
return products, prices, discounts, null_bitmapParquet (Using Definition Levels)
def parquet_encode_sparse(data):
products, prices, discounts, def_levels = [], [], [], []
for item in data:
products.append(item["product"])
prices.append(item["price"])
if "discount" in item:
discounts.append(item["discount"])
def_levels.append(1) # Present value
else:
def_levels.append(0) # NULL value
return products, prices, discounts, def_levelsDefinition Levels in Nested Structs
For deeply nested fields, Parquet’s definition levels avoid storing separate null bitmaps for each level. Instead of tracking NULL for each nested field independently, Parquet can store a single definition level per row that indicates if an entire struct is missing, saving space and processing overhead.
Example:
[
{ "user": null },
{ "user": { "age": null } },
{ "user": { "age": 30 } }
]Definition Levels for user.age:
[0, 1, 2]
Values: [30]
0: Theuserstruct is missing, soageis implicitly NULL.1: Theuserexists, butageitself is NULL.2: Theagefield exists and has a value (30).
Queries scanning mostly non-null rows may benefit from ORC’s structured bitmap approach, but for workloads with frequent NULLs and nested structures, Parquet’s definition-level encoding avoids redundant storage and lookup overhead, making it more compact and efficient.