IndexTables for Spark
IndexTables is an open-table format for Apache Spark that stores data in Tantivy search index files instead of Parquet. Where Iceberg and Delta Lake wrap columnar files with a transaction log, IndexTables wraps full-text search indexes with a transaction log — every column is indexed by default, and SQL predicates translate to native search engine operations rather than full scans.
Prerequisites
- Apache Spark Query Planning and Execution — logical/physical plans, Catalyst optimizer
- Spark SQL Extension Framework — how Spark extensions inject custom parsers and rules
- Arrow C data interface — the C Data Interface used for zero-copy JVM ↔ Rust exchange
Architecture overview
IndexTables layers: the Spark Driver compiles SQL through a custom parser and Catalyst rules, executors use Arrow FFI to exchange data zero-copy with the Rust-based Tantivy engine, and splits + transaction log versions live on cloud or local storage.
Why it exists
For log observability, cybersecurity investigations, and similar use cases, analysts need sub-second query response over billions of rows. Traditional columnar formats require full scans or limited predicate pushdown. IndexTables embeds a search engine directly in Spark executors:
- Full-text search via a custom
indexquerySQL operator (Tantivy query syntax) - Predicate pushdown that converts WHERE clauses to native Tantivy queries
- Aggregate pushdown (COUNT, SUM, AVG, MIN, MAX) executed inside the search engine
- Min/max data skipping — same idea as Iceberg/Delta, but evaluated in a single JNI call
- No external infrastructure — runs entirely within Spark, no Elasticsearch or Solr cluster
Core concepts
Splits
The unit of storage is a split — a self-contained Tantivy index archive in the Quickwit .split format. Each split contains:
- A Tantivy inverted index (term dictionaries, postings lists, positions)
- Fast fields (columnar storage for aggregation-eligible fields)
- A footer with byte offsets that enables lazy loading — only the footer needs to be fetched initially before deciding whether to read the rest
Splits are analogous to Parquet files in Delta Lake: the transaction log tracks which splits are live, and operations like MERGE SPLITS compact small splits into larger ones (default target: 5 GB).
Transaction log
Modeled after Delta Lake’s protocol. Each version is a file containing newline-delimited JSON actions:
| Action | Purpose |
|---|---|
MetadataAction | Schema, table properties, partitioning |
ProtocolAction | Format version (V1–V4) |
AddAction | Register a new split (carries min/max stats, footer offsets, doc mapping, merge count) |
RemoveAction | Tombstone a deleted split |
SkipAction | Mark a split as skipped during listing |
Protocol V4 introduces an Avro state format with shared manifest files for faster log reads. The log implementation delegates to a native Rust module via JNI (tantivy4java) for optimistic concurrency, automatic retry, and LRU caching with configurable TTL.
Field type system
Every Spark column maps to a Tantivy field type. The mapping is configured per-field:
| Tantivy type | Spark config value | Behaviour |
|---|---|---|
| Text | text | Full-text analyzed — tokenized, lowercased, searchable via indexquery |
| String | string (default for StringType) | Exact match — raw tokenizer, no analysis |
| JSON | json | Nested JSON fields, searchable by path |
| IP | ip | IPv4/IPv6 addresses with range query support |
Configuration uses spark.indextables.indexing.typemap.<field> = text|string|json|ip. Tokenizer and index record options (basic, freq, position) are configurable per-field.
SQL extensions
IndexTables registers a custom SQL parser and Catalyst rules via SparkSessionExtensions. Beyond standard SELECT/INSERT, it adds:
| Command | What it does |
|---|---|
WHERE col indexquery 'term1 AND term2' | Full-text search — translates to Tantivy query syntax |
MERGE SPLITS table | Compact small splits into larger ones (like Delta’s OPTIMIZE) |
PURGE INDEXTABLE table | Remove tombstoned splits from storage |
PREWARM CACHE table | Pre-download splits to local NVMe cache |
CHECKPOINT table | Force a transaction log checkpoint |
BUILD INDEXTABLES COMPANION FOR DELTA table | Build a search index over an existing Delta/Iceberg/Parquet table |
DESCRIBE INDEXTABLE table | Show table metadata, split count, storage stats |
The indexquery operator is preprocessed by the parser into a function call (tantivy4spark_indexquery), which a Catalyst rule then intercepts and converts into a MixedBooleanFilter tree for pushdown to the scan builder.
Companion mode
IndexTables can operate as a search sidecar for an existing Delta Lake, Iceberg, or Parquet table. The BUILD COMPANION command reads data files from the source table, builds Tantivy indexes for selected columns, and stores companion splits alongside the original data. At read time:
- Indexed columns are fetched from the Tantivy split (fast field access)
- Non-indexed columns are fetched from the original Parquet files
- Results are merged by the native reader
This lets teams add full-text search to existing lakehouse tables without migrating data.
Configuration hierarchy
Configuration follows a precedence chain (lowest → highest):
- Hadoop configuration — cluster defaults
- Spark session config —
spark.indextables.*keys - Read/write options — per-query overrides via
.option()
Key read-time settings:
| Config | Default | Effect |
|---|---|---|
spark.indextables.read.mode | fast | fast limits results (interactive queries); complete returns all matches |
spark.indextables.read.defaultLimit | 1000 | Row limit in fast mode |
| L2 disk cache | auto-enabled on Databricks/EMR | Persistent NVMe caching with LZ4/ZSTD compression |
See also
- IndexTables Internals — DataSource V2 implementation, read/write paths, filter conversion, Arrow FFI bridge
- Apache Spark Query Planning and Execution — how Spark plans and executes queries
- Spark SQL Extension Framework — the extension points IndexTables hooks into
- Arrow C data interface — the zero-copy interface used for JVM ↔ Rust data exchange
- Introduction to Apache Iceberg — comparable open-table format for columnar data