Geoparquet

What GeoParquet Is

GeoParquet is a metadata convention for storing geospatial vector data in Apache Parquet files. The specification standardizes how to encode geometries as binary data and what metadata structure to write in the Parquet file footer to enable tool interoperability. GeoParquet does not define new Parquet column types: geometries are stored as standard BYTE_ARRAY columns containing Well-Known Binary (WKB) blobs, with a JSON structure in the file metadata describing the spatial properties.

The specification emerged to solve a practical problem: Parquet had no native support for geospatial data, so tools needed a common convention for storing geometries and their associated metadata like coordinate reference systems, bounding boxes, and geometry type information. GeoParquet became that standard, adopted by GDAL, GeoPandas, DuckDB, Sedona, and other geospatial tools.

WKB Encoding and Geometry Types

Well-Known Binary (WKB) is an ISO 19125-1 standard that serializes geometries as byte arrays. A Point(1.0, 2.0) in WKB has this structure:

Byte 0    | Bytes 1-4  | Bytes 5-12        | Bytes 13-20
01        | 01000000   | 0000000000F03F    | 0000000000000040
(endian)  | (type=1)   | (x=1.0)           | (y=2.0)

The first byte indicates endianness (01 = little-endian). Bytes 1-4 contain the geometry type as a uint32. The remaining bytes encode coordinate data as doubles.

WKB defines seven fundamental geometry types, each with a numeric code:

Point (1): Single coordinate pair representing a location. Example: a restaurant at (40.7128, -74.0060).

LineString (2): Ordered sequence of points forming a path. Example: a road segment connecting multiple intersections.

Polygon (3): Closed area with an exterior ring and optional interior rings (holes). Example: a building footprint or lake boundary.

MultiPoint (4): Collection of separate points. Example: multiple sensor locations for a network.

MultiLineString (5): Collection of separate line paths. Example: a river system with multiple branches.

MultiPolygon (6): Collection of separate polygons. Example: Hawaii’s islands or a country with non-contiguous territories.

GeometryCollection (7): Arbitrary mix of any geometry types. Example: a city infrastructure dataset with mixed features.

A critical aspect of WKB encoding is that a single Parquet column can contain multiple geometry types in different rows. This differs from typical Parquet columns where every value has the same type (all INT32 or all DOUBLE). With WKB, the physical type is BYTE_ARRAY for all rows, but the logical geometry type varies:

row_id	geometry (BYTE_ARRAY)	Actual type
1	`\x01\x01\x00\x00\x00...`	Point
2	`\x01\x03\x00\x00\x00...`	Polygon
3	`\x01\x06\x00\x00\x00...`	MultiPolygon

This flexibility makes WKB suitable for heterogeneous geospatial datasets like OpenStreetMap, where a single table contains points (nodes), linestrings (ways), and polygons (areas).

Metadata Structure and Location

Parquet files have a file-level key-value metadata map stored in the file footer, distinct from row group statistics and page-level statistics. GeoParquet adds a single entry with key "geo" containing a JSON string:

FileMetaData {
    schema: SchemaElement[]
    row_groups: RowGroup[]
    key_value_metadata: [
        KeyValue { key: "geo", value: "{...}" }
    ]
}

The JSON structure specifies schema-level properties for all geometry columns:

{
  "version": "1.1.0",
  "primary_column": "geometry",
  "columns": {
    "geometry": {
      "encoding": "WKB",
      "geometry_types": ["Polygon", "MultiPolygon"],
      "crs": {
        "type": "GeographicCRS",
        "name": "WGS 84",
        "id": { "authority": "EPSG", "code": 4326 }
      },
      "orientation": "counterclockwise",
      "edges": "planar",
      "bbox": [-180.0, -90.0, 180.0, 90.0]
    }
  }
}

The geometry_types array lists all types present in that column. Without this metadata, discovering what types exist requires scanning every row and parsing the WKB type code. With the metadata, readers know upfront “this column contains Polygons and MultiPolygons” which enables parser optimizations. A homogeneous column would list a single type like ["Point"], allowing specialized fast parsers instead of generic ones handling all seven types.

The bbox field represents the spatial extent of all geometries in the entire file as [xmin, ymin, xmax, ymax]. This enables file-level filtering without reading any data pages.

The orientation field specifies whether polygon rings are counterclockwise (exterior rings) or clockwise (holes). The edges field distinguishes between planar (flat) and spherical (geodesic) edge interpolation, affecting distance and area calculations.

Coordinate Reference Systems

A Coordinate Reference System (CRS) defines how coordinate numbers map to actual positions on Earth. The coordinates (500000, 4649776) have completely different meanings depending on the CRS:

EPSG:4326 (WGS 84): These coordinates are invalid—latitude/longitude ranges are ±90° and ±180°.

EPSG:3857 (Web Mercator): (500000, 4649776) represents a location in Europe, measured in meters from the equator and prime meridian.

EPSG:32610 (UTM Zone 10N): (500000, 4649776) represents a location on the California coast, in meters using a regional projection optimized for accuracy in that zone.

WKB encoding stores only coordinate numbers without semantic meaning. Without CRS information, you cannot correctly render geometries on a map, compute distances, or transform coordinates between systems.

GeoParquet stores CRS using PROJJSON format (PROJ v6+), a JSON encoding of the WKT2:2019 standard:

"crs": {
  "type": "GeographicCRS",
  "name": "WGS 84",
  "datum": {
    "type": "GeodeticReferenceFrame",
    "name": "World Geodetic System 1984"
  },
  "coordinate_system": {
    "subtype": "ellipsoidal",
    "axis": [...]
  },
  "id": { "authority": "EPSG", "code": 4326 }
}

For common coordinate systems, the minimal form with just the EPSG identifier is sufficient. If omitted, the default is OGC:CRS84 (WGS84 in longitude-latitude order).

Query Optimization Mechanisms

The metadata enables spatial filtering at two levels without requiring modifications to Parquet itself.

File-level filtering occurs during query planning before reading data pages. For a query SELECT * FROM dataset WHERE ST_Intersects(geometry, 'POLYGON(...)'), the query engine:

Reads the Parquet file footer (always read first for schema and metadata)
Extracts the "bbox": [-180, -90, 180, 90] from the geo JSON
Computes the query polygon’s bounding box, e.g., [10, 10, 20, 20]
Tests geometric intersection: query_bbox.intersects(file_bbox)
Skips the entire file if no intersection

This is simple bounding box intersection testing, not Parquet-specific predicate pushdown. The optimization relies on the file-level bbox being a conservative bound for all geometries inside.

Row group filtering requires per-row-group spatial statistics. Since WKB blobs in BYTE_ARRAY columns have no meaningful min/max ordering, standard Parquet statistics don’t work. GeoParquet v1.0 introduced covering columns as a workaround:

"covering": {
  "bbox": {
    "xmin": ["bbox", "xmin"],
    "ymin": ["bbox", "ymin"],
    "xmax": ["bbox", "xmax"],
    "ymax": ["bbox", "ymax"]
  }
}

The Parquet file contains actual DOUBLE columns for each bbox component:

geometry (BYTE_ARRAY)	bbox.xmin	bbox.ymin	bbox.xmax	bbox.ymax
<WKB blob 1>	10.0	20.0	15.0	25.0
<WKB blob 2>	50.0	60.0	55.0	65.0

These bbox columns have standard Parquet min/max statistics per row group. A query engine reads these statistics and skips row groups where the query polygon doesn’t intersect the bbox bounds. This approach requires storing redundant data—the bbox is derivable from the geometry but materialized separately for performance.

Warning

Many implementations skip row group filtering entirely and rely only on file-level filtering, especially when files are spatially partitioned or appropriately sized for query patterns.

Parquet Native GEOMETRY and GEOGRAPHY Types

Apache Parquet 2.11.0 (March 2025) introduced native GEOMETRY and GEOGRAPHY logical types that fundamentally change the geospatial landscape. These types provide built-in support for geometries with integrated statistics, eliminating the need for covering columns.

The GEOMETRY logical type annotates BYTE_ARRAY columns and includes a CRS parameter:

struct GeometryType {
  1: optional string crs;
}

The CRS is stored as a simple string in the schema, typically an EPSG code like "EPSG:4326" or "EPSG:32620". If omitted, it defaults to "OGC:CRS84" (WGS84 in longitude-latitude order). This CRS information lives in the column schema annotation, not in file metadata.

The GEOGRAPHY type has no CRS parameter—it implicitly represents spherical geometries on WGS84, using geodesic edge interpolation for accurate distance and area calculations on the globe.

The critical advancement is built-in bounding box statistics at the row group level:

struct Statistics {
  ...
  optional GeospatialBBox geospatial_bbox;
}
 
struct GeospatialBBox {
  1: required double xmin
  2: required double ymin
  3: required double xmax
  4: required double ymax
}

These statistics are written directly into ColumnMetaData for each row group, enabling row group skipping without covering columns. Query engines read the bbox statistics and filter row groups using standard Parquet mechanisms.

Aspect	GeoParquet v1.0	Parquet GEOMETRY Type
CRS Storage	PROJJSON in file metadata	EPSG string in schema
CRS Format	Complete definition with datum, coordinate system	Simple identifier string
Row Group Stats	Requires covering columns	Built-in bbox statistics
Geometry Types	Listed in metadata	Inferred from schema type
File-level bbox	In geo metadata	Not standardized
Compatibility	Any Parquet version	Requires 2.11.0+

The native GEOMETRY type solves the statistics problem elegantly but provides minimal CRS information—just an EPSG code that tools must look up externally. GeoParquet’s PROJJSON format is self-contained with complete CRS definitions including datum, coordinate system, and transformation parameters.

Important

A file using native GEOMETRY types is not automatically GeoParquet-compliant. GeoParquet requires the specific "geo" metadata structure in the file footer. Conversely, a GeoParquet file can use GEOMETRY types with the geo metadata providing additional CRS detail.

The Transition to GeoParquet 2.0

In February 2025, the GeoParquet community announced GeoParquet 2.0: Going Native, a transition to use Parquet’s native GEOMETRY and GEOGRAPHY types. This development reflects the reality that native types solve most of GeoParquet v1.0’s problems:

What native types eliminate:

The need for covering columns (bbox statistics are built-in)
Complex encoding specifications (type system handles it)
Geometry type discovery (schema declares it)

What remains valuable from GeoParquet:

Rich PROJJSON CRS definitions for complex coordinate systems
File-level bbox for fast file skipping
Standardized metadata keys for tool interoperability
Additional semantics: orientation, edges, epoch

The practical reality is that for most use cases, an EPSG code string is sufficient. If 90% of geospatial data uses WGS84 or Web Mercator, the complexity of full PROJJSON definitions becomes unnecessary. Modern data catalogs (Unity Catalog, Iceberg) can store CRS as table-level metadata rather than duplicating it in every file’s footer.

GeoParquet v1.0 will remain relevant for WKB-based files and backward compatibility, but new implementations are adopting native types with minimal metadata overlays. The transition period means tools must support both approaches—legacy GeoParquet v1.0 files and newer native-type files.

Writing GeoParquet from Python

GeoPandas 0.11.0+ automatically writes GeoParquet v1.0 format:

import geopandas as gpd
 
gdf = gpd.read_file("input.geojson")
gdf.to_parquet("output.parquet", compression="snappy")
 
# Verify metadata
import pyarrow.parquet as pq
metadata = pq.read_metadata("output.parquet")
geo_metadata = metadata.metadata[b'geo'].decode('utf-8')
print(geo_metadata)

For manual control over the metadata structure:

import json
import pyarrow as pa
import pyarrow.parquet as pq
 
geo_meta = {
    "version": "1.1.0",
    "primary_column": "geom",
    "columns": {
        "geom": {
            "encoding": "WKB",
            "geometry_types": ["Point"],
            "crs": {
                "id": { "authority": "EPSG", "code": 4326 }
            },
            "bbox": [0, 0, 100, 100]
        }
    }
}
 
# Create table with WKB-encoded geometry
wkb_data = [b'\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\xf0?...']
table = pa.table({"geom": wkb_data, "id": [1]})
 
# Attach geo metadata to schema
metadata = table.schema.metadata or {}
metadata[b'geo'] = json.dumps(geo_meta).encode('utf-8')
table = table.replace_schema_metadata(metadata)
 
pq.write_table(table, "output.parquet")

For writing files with native GEOMETRY types, PyArrow 15.0+ supports the geometry extension type, though tooling is still maturing as of December 2025.

Edmondo's Vault

Explorer

Geoparquet

What GeoParquet Is

WKB Encoding and Geometry Types

Metadata Structure and Location

Coordinate Reference Systems

Query Optimization Mechanisms

Parquet Native GEOMETRY and GEOGRAPHY Types

The Transition to GeoParquet 2.0

Writing GeoParquet from Python

Graph View

Table of Contents

Backlinks