Parquet Types

Parquet data is stored on disk using a small set of physical types. The valid physical types are:

BOOLEAN
INT32
INT64
FLOAT
DOUBLE
BYTE_ARRAY
FIXED_LEN_BYTE_ARRAY.
These types form the low-level representation, chosen for efficiency and compactness. For instance, INT32 and INT64 offer fixed-size integer storage, while BYTE_ARRAY is used for variable-length data.

Logical Types and Their Role

To capture the semantic meaning of data, Parquet uses logical types (also called type annotations). Logical types map to physical types so that, although the on-disk storage is one of the few physical types, the stored value can be interpreted with richer context. For example, a BYTE_ARRAY may be annotated as UTF8 to represent a string. The evolution from ConvertedType to LogicalType has made this mapping more extensible.

Mapping Logical Types to Physical Types

A logical type restricts which physical types are valid for a given semantic interpretation. Below is a mapping that summarizes the common logical types and their corresponding valid physical types:

Logical Type            | Underlying Physical Type(s)
────────────────────────┼────────────────────────────────────────────
UTF8 (STRING)           | BYTE_ARRAY, FIXED_LEN_BYTE_ARRAY
ENUM                    | BYTE_ARRAY, FIXED_LEN_BYTE_ARRAY
DECIMAL                 | INT32, INT64, BYTE_ARRAY, FIXED_LEN_BYTE_ARRAY
DATE                    | INT32
TIME (MILLIS)           | INT32
TIME (MICROS)           | INT64
TIMESTAMP (MILLIS)      | INT64
TIMESTAMP (MICROS)      | INT64
INTERVAL                | FIXED_LEN_BYTE_ARRAY (length = 12)

For example, if a field is annotated as DATE, its physical type must be INT32; similarly, a DECIMAL can be stored in multiple physical forms provided the precision and scale are met.

The Parquet file is self-describing. Its footer—located at the end of the file—contains the complete schema and metadata, serialized in Thrift’s binary format. The last 8 bytes of a Parquet file are critical: the final 4 bytes are the magic string “PAR1”, and the 4 bytes immediately before that encode the footer length as a little-endian 32-bit integer. This allows a reader to seek to the start of the metadata, which includes an array of SchemaElement objects representing each field’s name, physical type, repetition type, and any logical type annotations.

A Python example using the struct module illustrates how to read these final bytes:

import struct
 
with open("example.parquet", "rb") as f:
    f.seek(-8, 2)  # Seek to the last 8 bytes
    footer_length = struct.unpack("<I", f.read(4))[0]
    magic = f.read(4)
    if magic != b"PAR1":
        raise ValueError("Not a valid Parquet file!")
    f.seek(-8 - footer_length, 2)
    footer = f.read(footer_length)
print("Footer length:", footer_length)

This snippet validates the file and extracts the footer where the schema is stored.

Reading the Schema with PyArrow

Using PyArrow, one can easily inspect a Parquet file’s schema. For example:

import pyarrow.parquet as pq
 
parquet_file = "example.parquet"
pf = pq.ParquetFile(parquet_file)
print(pf.schema)

The output shows each field’s physical type and any logical type annotation (e.g., a BYTE_ARRAY annotated as STRING).

Variant Shredding and Value Representations

To handle variant fields—columns that may hold heterogeneous types such as a JSON field that might be a string, an int64, or a boolean—Parquet proposes a mechanism called Variant Shredding. The proposal is still being debated

Edmondo's Vault

Explorer

Parquet Types

Logical Types and Their Role

Mapping Logical Types to Physical Types

Schema Storage in the Footer

Reading the Schema with PyArrow

Variant Shredding and Value Representations

Graph View

Table of Contents

Backlinks