IndexTables for Spark

IndexTables is an open-table format for Apache Spark that stores data in Tantivy search index files instead of Parquet. Where Iceberg and Delta Lake wrap columnar files with a transaction log, IndexTables wraps full-text search indexes with a transaction log — every column is indexed by default, and SQL predicates translate to native search engine operations rather than full scans.

Prerequisites

Architecture overview

IndexTables layers: the Spark Driver compiles SQL through a custom parser and Catalyst rules, executors use Arrow FFI to exchange data zero-copy with the Rust-based Tantivy engine, and splits + transaction log versions live on cloud or local storage.

Why it exists

For log observability, cybersecurity investigations, and similar use cases, analysts need sub-second query response over billions of rows. Traditional columnar formats require full scans or limited predicate pushdown. IndexTables embeds a search engine directly in Spark executors:

  • Full-text search via a custom indexquery SQL operator (Tantivy query syntax)
  • Predicate pushdown that converts WHERE clauses to native Tantivy queries
  • Aggregate pushdown (COUNT, SUM, AVG, MIN, MAX) executed inside the search engine
  • Min/max data skipping — same idea as Iceberg/Delta, but evaluated in a single JNI call
  • No external infrastructure — runs entirely within Spark, no Elasticsearch or Solr cluster

Core concepts

Splits

The unit of storage is a split — a self-contained Tantivy index archive in the Quickwit .split format. Each split contains:

  • A Tantivy inverted index (term dictionaries, postings lists, positions)
  • Fast fields (columnar storage for aggregation-eligible fields)
  • A footer with byte offsets that enables lazy loading — only the footer needs to be fetched initially before deciding whether to read the rest

Splits are analogous to Parquet files in Delta Lake: the transaction log tracks which splits are live, and operations like MERGE SPLITS compact small splits into larger ones (default target: 5 GB).

Transaction log

Modeled after Delta Lake’s protocol. Each version is a file containing newline-delimited JSON actions:

ActionPurpose
MetadataActionSchema, table properties, partitioning
ProtocolActionFormat version (V1–V4)
AddActionRegister a new split (carries min/max stats, footer offsets, doc mapping, merge count)
RemoveActionTombstone a deleted split
SkipActionMark a split as skipped during listing

Protocol V4 introduces an Avro state format with shared manifest files for faster log reads. The log implementation delegates to a native Rust module via JNI (tantivy4java) for optimistic concurrency, automatic retry, and LRU caching with configurable TTL.

Field type system

Every Spark column maps to a Tantivy field type. The mapping is configured per-field:

Tantivy typeSpark config valueBehaviour
TexttextFull-text analyzed — tokenized, lowercased, searchable via indexquery
Stringstring (default for StringType)Exact match — raw tokenizer, no analysis
JSONjsonNested JSON fields, searchable by path
IPipIPv4/IPv6 addresses with range query support

Configuration uses spark.indextables.indexing.typemap.<field> = text|string|json|ip. Tokenizer and index record options (basic, freq, position) are configurable per-field.

SQL extensions

IndexTables registers a custom SQL parser and Catalyst rules via SparkSessionExtensions. Beyond standard SELECT/INSERT, it adds:

CommandWhat it does
WHERE col indexquery 'term1 AND term2'Full-text search — translates to Tantivy query syntax
MERGE SPLITS tableCompact small splits into larger ones (like Delta’s OPTIMIZE)
PURGE INDEXTABLE tableRemove tombstoned splits from storage
PREWARM CACHE tablePre-download splits to local NVMe cache
CHECKPOINT tableForce a transaction log checkpoint
BUILD INDEXTABLES COMPANION FOR DELTA tableBuild a search index over an existing Delta/Iceberg/Parquet table
DESCRIBE INDEXTABLE tableShow table metadata, split count, storage stats

The indexquery operator is preprocessed by the parser into a function call (tantivy4spark_indexquery), which a Catalyst rule then intercepts and converts into a MixedBooleanFilter tree for pushdown to the scan builder.

Companion mode

IndexTables can operate as a search sidecar for an existing Delta Lake, Iceberg, or Parquet table. The BUILD COMPANION command reads data files from the source table, builds Tantivy indexes for selected columns, and stores companion splits alongside the original data. At read time:

  1. Indexed columns are fetched from the Tantivy split (fast field access)
  2. Non-indexed columns are fetched from the original Parquet files
  3. Results are merged by the native reader

This lets teams add full-text search to existing lakehouse tables without migrating data.

Configuration hierarchy

Configuration follows a precedence chain (lowest → highest):

  1. Hadoop configuration — cluster defaults
  2. Spark session configspark.indextables.* keys
  3. Read/write options — per-query overrides via .option()

Key read-time settings:

ConfigDefaultEffect
spark.indextables.read.modefastfast limits results (interactive queries); complete returns all matches
spark.indextables.read.defaultLimit1000Row limit in fast mode
L2 disk cacheauto-enabled on Databricks/EMRPersistent NVMe caching with LZ4/ZSTD compression

See also