I now have enough information to write a comprehensive note. Here it is:
Overview
Substrait is a cross-language serialization specification for relational algebra. It defines a language-agnostic format for representing query plans, enabling different systems to exchange plans without building point-to-point integrations. The core idea is to decouple plan producers (query frontends, optimizers) from plan consumers (execution engines).
Before Substrait, exchanging query plans between systems required custom translation layers. A system like Ibis producing plans for both DuckDB and DataFusion needed separate code paths for each backend. Substrait introduces a common intermediate representation (IR) that any producer can emit and any consumer can execute.
Serialization Format
Substrait uses Protocol Buffers as its primary serialization format. The protobuf schema defines messages for relations, expressions, types, and extensions. A plan is serialized as a tree of Rel (relation) messages, where each node represents a relational operation.
The core relation types map directly to relational algebra:
| Relation | Purpose |
|---|---|
ReadRel | Scan data from a source (table, file, virtual table) |
FilterRel | Apply a predicate, emitting rows that satisfy the condition |
ProjectRel | Compute new expressions, select/reorder columns |
JoinRel | Combine two inputs based on a join condition |
AggregateRel | Group and aggregate |
SortRel | Order rows by specified expressions |
FetchRel | Limit and offset (equivalent to SQL LIMIT/OFFSET) |
SetRel | Union, intersection, difference |
A simple query like SELECT name, age FROM users WHERE age > 30 would serialize as:
ProjectRel
└── FilterRel (condition: age > 30)
└── ReadRel (source: users, schema: [id, name, age, ...])
The protobuf representation:
message ProjectRel {
RelCommon common = 1;
Rel input = 2;
repeated Expression expressions = 3;
}
message FilterRel {
RelCommon common = 1;
Rel input = 2;
Expression condition = 3;
}
message ReadRel {
RelCommon common = 1;
NamedStruct base_schema = 2;
Expression filter = 3;
oneof read_type {
VirtualTable virtual_table = 4;
LocalFiles local_files = 5;
NamedTable named_table = 6;
ExtensionTable extension_table = 7;
}
}Extension Mechanism
Substrait cannot define every possible function across all systems — different engines support different scalar functions, aggregates, and window functions with varying semantics. The extension system addresses this by allowing functions to be defined externally and referenced by URI.
Extensions are specified in YAML files that define function signatures:
%YAML 1.2
---
scalar_functions:
- name: "add"
description: "Add two values"
impls:
- args:
- name: x
value: i32
- name: y
value: i32
return: i32
- args:
- name: x
value: fp64
- name: y
value: fp64
return: fp64A Substrait plan references extensions via URIs. The plan includes an extension declaration section that maps short identifiers to full URIs:
Plan {
extension_uris: [
{ extension_uri_anchor: 1, uri: "https://github.com/substrait-io/substrait/blob/main/extensions/functions_arithmetic.yaml" }
]
extensions: [
{ extension_function: { extension_uri_reference: 1, function_anchor: 100, name: "add:i32_i32" } }
]
relations: [ ... ]
}Within expressions, functions are referenced by their anchor (the integer 100 above), keeping the plan compact while maintaining full traceability to the function definition.
Tip
Substrait ships with a standard extension library covering common functions (arithmetic, string, datetime, comparison). Custom extensions follow the same YAML format and can be hosted anywhere.
Producer and Consumer Architecture
The power of Substrait emerges when multiple producers and consumers exist in an ecosystem:
Producers (generate Substrait plans):
- Ibis: Python dataframe API that can emit Substrait from table expressions
- Calcite: Java query planning framework with Substrait serialization
- Isthmus: SQL-to-Substrait parser (converts SQL text directly to Substrait plans)
Consumers (execute Substrait plans):
- DuckDB: In-process OLAP database with a Substrait extension
- DataFusion: Rust query engine with native Substrait support
- Velox: C++ execution engine (Meta) with Substrait consumer interface
- Acero: C++ execution engine (Arrow project)
This architecture enables composition. Ibis can serve as a user-facing API, emit Substrait, and delegate execution to whichever backend is most appropriate — DuckDB for local analytics, DataFusion for Rust-native pipelines, or Velox for integration with Meta’s infrastructure.
┌─────────────────┐ ┌─────────────────┐
│ Ibis (Python) │ │ SQL Parser │
│ │ │ (Isthmus) │
└────────┬────────┘ └────────┬────────┘
│ │
│ Substrait Plan │
▼ ▼
┌─────────────────────────────────────────┐
│ Substrait (Protobuf) │
└─────────────────────────────────────────┘
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ DuckDB │ │ DataFusion │
└─────────────────┘ └─────────────────┘
Practical Example: Ibis to DuckDB
Ibis can generate a Substrait plan from a Python expression and pass it to DuckDB for execution:
import ibis
from ibis_substrait.compiler.core import SubstraitCompiler
# Define the expression in Ibis
con = ibis.duckdb.connect()
t = con.table("users")
expr = t.filter(t.age > 30).select("name", "age")
# Compile to Substrait
compiler = SubstraitCompiler()
plan = compiler.compile(expr)
# Execute via DuckDB's Substrait consumer
import duckdb
duckdb_con = duckdb.connect()
result = duckdb_con.from_substrait(plan.SerializeToString())
print(result.fetchall())The same plan bytes could be sent to DataFusion, Velox, or any other Substrait consumer without modification.
Current Status and Limitations
As of early 2026, Substrait is approaching version 1.0 but has not yet reached full stability. The specification is largely complete, but some areas remain under active development.
Current limitations:
The specification focuses primarily on logical plans. Physical plan details (partitioning, distribution, parallelism hints) are less mature, which limits Substrait’s applicability for distributed execution planning. Systems that need to exchange physical plans still rely on custom formats.
Not all SQL features have standardized representations. Complex constructs like lateral joins with LIMIT, recursive CTEs, and certain window function frames may require engine-specific extensions or may not be expressible in the current spec.
Consumer implementations vary in completeness. A Substrait plan that works in DuckDB may fail in Velox if it uses relations or functions that Velox hasn’t implemented yet. Testing against multiple backends is essential.
Warning
Substrait does not guarantee semantic equivalence across consumers. Two engines may interpret edge cases (NULL handling, overflow behavior, floating-point precision) differently. The plan format is standardized; the execution semantics depend on the consumer.
When Substrait is Useful
Substrait provides value in scenarios where plan portability matters:
Building a query frontend (like Ibis) that should work with multiple backends without maintaining separate code paths for each engine.
Creating optimizer-as-a-service architectures where optimization happens in one system and execution in another. Recent research (UQO*, VLDB 2025) demonstrates using Substrait to decouple query optimization from execution engines like Spark.
Implementing pushdown to heterogeneous storage systems. The Presto-OCS connector uses Substrait to communicate pushed-down predicates and projections to object storage services.
For systems that only need to talk to a single execution engine, Substrait adds complexity without clear benefit. The value scales with the number of integration points in the architecture.