IndexTables for Spark

IndexTables is an open-table format for Apache Spark that stores data in Tantivy search index files instead of Parquet. Where Iceberg and Delta Lake wrap columnar files with a transaction log, IndexTables wraps full-text search indexes with a transaction log — every column is indexed by default, and SQL predicates translate to native search engine operations rather than full scans.

Prerequisites

Apache Spark Query Planning and Execution — logical/physical plans, Catalyst optimizer

Spark SQL Extension Framework — how Spark extensions inject custom parsers and rules

Arrow C data interface — the C Data Interface used for zero-copy JVM ↔ Rust exchange

Architecture overview

IndexTables layers: the Spark Driver compiles SQL through a custom parser and Catalyst rules, executors use Arrow FFI to exchange data zero-copy with the Rust-based Tantivy engine, and splits + transaction log versions live on cloud or local storage.

Why it exists

For log observability, cybersecurity investigations, and similar use cases, analysts need sub-second query response over billions of rows. Traditional columnar formats require full scans or limited predicate pushdown. IndexTables embeds a search engine directly in Spark executors:

Full-text search via a custom indexquery SQL operator (Tantivy query syntax)
Predicate pushdown that converts WHERE clauses to native Tantivy queries
Aggregate pushdown (COUNT, SUM, AVG, MIN, MAX) executed inside the search engine
Min/max data skipping — same idea as Iceberg/Delta, but evaluated in a single JNI call
No external infrastructure — runs entirely within Spark, no Elasticsearch or Solr cluster

Core concepts

Splits

The unit of storage is a split — a self-contained Tantivy index archive in the Quickwit .split format. Each split contains:

A Tantivy inverted index (term dictionaries, postings lists, positions)
Fast fields (columnar storage for aggregation-eligible fields)
A footer with byte offsets that enables lazy loading — only the footer needs to be fetched initially before deciding whether to read the rest

Splits are analogous to Parquet files in Delta Lake: the transaction log tracks which splits are live, and operations like MERGE SPLITS compact small splits into larger ones (default target: 5 GB).

Transaction log

Modeled after Delta Lake’s protocol. Each version is a file containing newline-delimited JSON actions:

Action	Purpose
`MetadataAction`	Schema, table properties, partitioning
`ProtocolAction`	Format version (V1–V4)
`AddAction`	Register a new split (carries min/max stats, footer offsets, doc mapping, merge count)
`RemoveAction`	Tombstone a deleted split
`SkipAction`	Mark a split as skipped during listing

Protocol V4 introduces an Avro state format with shared manifest files for faster log reads. The log implementation delegates to a native Rust module via JNI (tantivy4java) for optimistic concurrency, automatic retry, and LRU caching with configurable TTL.

Field type system

Every Spark column maps to a Tantivy field type. The mapping is configured per-field:

Tantivy type	Spark config value	Behaviour
Text	`text`	Full-text analyzed — tokenized, lowercased, searchable via `indexquery`
String	`string` (default for `StringType`)	Exact match — raw tokenizer, no analysis
JSON	`json`	Nested JSON fields, searchable by path
IP	`ip`	IPv4/IPv6 addresses with range query support

Configuration uses spark.indextables.indexing.typemap.<field> = text|string|json|ip. Tokenizer and index record options (basic, freq, position) are configurable per-field.

SQL extensions

IndexTables registers a custom SQL parser and Catalyst rules via SparkSessionExtensions. Beyond standard SELECT/INSERT, it adds:

Command	What it does
`WHERE col indexquery 'term1 AND term2'`	Full-text search — translates to Tantivy query syntax
`MERGE SPLITS table`	Compact small splits into larger ones (like Delta’s OPTIMIZE)
`PURGE INDEXTABLE table`	Remove tombstoned splits from storage
`PREWARM CACHE table`	Pre-download splits to local NVMe cache
`CHECKPOINT table`	Force a transaction log checkpoint
`BUILD INDEXTABLES COMPANION FOR DELTA table`	Build a search index over an existing Delta/Iceberg/Parquet table
`DESCRIBE INDEXTABLE table`	Show table metadata, split count, storage stats

The indexquery operator is preprocessed by the parser into a function call (tantivy4spark_indexquery), which a Catalyst rule then intercepts and converts into a MixedBooleanFilter tree for pushdown to the scan builder.

Companion mode

IndexTables can operate as a search sidecar for an existing Delta Lake, Iceberg, or Parquet table. The BUILD COMPANION command reads data files from the source table, builds Tantivy indexes for selected columns, and stores companion splits alongside the original data. At read time:

Indexed columns are fetched from the Tantivy split (fast field access)
Non-indexed columns are fetched from the original Parquet files
Results are merged by the native reader

This lets teams add full-text search to existing lakehouse tables without migrating data.

Configuration hierarchy

Configuration follows a precedence chain (lowest → highest):

Hadoop configuration — cluster defaults
Spark session config — spark.indextables.* keys
Read/write options — per-query overrides via .option()

Key read-time settings:

Config	Default	Effect
`spark.indextables.read.mode`	`fast`	`fast` limits results (interactive queries); `complete` returns all matches
`spark.indextables.read.defaultLimit`	1000	Row limit in fast mode
L2 disk cache	auto-enabled on Databricks/EMR	Persistent NVMe caching with LZ4/ZSTD compression

Edmondo's Vault

Explorer

IndexTables for Spark

IndexTables for Spark

Architecture overview

Why it exists

Core concepts

Splits

Transaction log

Field type system

SQL extensions

Companion mode

Configuration hierarchy

See also

Graph View

Table of Contents

Backlinks