Fjall

Fjall is an embedded LSM-tree key-value store written in pure Rust, designed as a lightweight alternative to RocksDB. Current version 2.11.2. It’s a library you link into your application, not a separate server process. The primary differentiator is being native Rust rather than C++ with FFI bindings, which matters for type safety integration and static linking simplicity.

LSM Tree Architecture

The Log-Structured Merge tree optimizes for write throughput over read latency by treating writes as append-only operations. Data flows through three stages: in-memory MemTable implemented as a concurrent skip list, immutable SSTables (Sorted String Tables) on disk, and background compaction that merges SSTables across levels. This architecture achieves high write throughput because writes are sequential disk operations, which are 10-100x faster than random writes on SSDs.

The write path starts with appending to a Write-Ahead Log for crash recovery, then inserting into the active MemTable. When the MemTable reaches its size threshold (default 64MB), it becomes immutable and flushes to disk as a Level 0 SSTable. L0 files may have overlapping key ranges because they come directly from MemTable flushes. Background compaction merges L0 files into L1, where files have non-overlapping key ranges. Each subsequent level is roughly 10x larger than the previous one.

// Write path with fjall 2.11
let keyspace = Config::new(folder).open()?;
let partition = keyspace.open_partition("data", PartitionCreateOptions::default())?;
 
partition.insert(b"key", b"value")?;
// 1. Appends to WAL (fsync)
// 2. Inserts into MemTable (in-memory)
// 3. Returns immediately

The read path must check multiple locations because recent data may not have reached lower levels yet. It searches the active MemTable first, then immutable MemTables waiting to flush, then all L0 SSTables (since they overlap), then performs binary search by key range in L1+ levels. Each SSTable contains a Bloom filter (probabilistic data structure with ~1% false positive rate) and a block index. The Bloom filter quickly rejects keys that don’t exist without disk I/O.

Amplification Properties

LSM trees exhibit three forms of amplification that fundamentally limit their performance. Write amplification occurs because the same logical write physically writes data multiple times. A key-value pair written to the MemTable gets written once to L0, then again when L0 compacts to L1, again when L1 compacts to L2, and so on. The amplification factor is typically 10-30x, meaning writing 1GB of user data results in 10-30GB of actual disk writes over time. This is inherent to maintaining sorted order across levels.

Space amplification happens because deleted or updated data isn’t immediately removed. When you delete a key, LSM trees write a tombstone marker rather than finding and removing the original entry. The actual space is reclaimed only when compaction merges the tombstone with the original data. Similarly, updates create new versions while old versions persist until compaction. Space amplification is typically 1.1-2x, varying based on compaction aggressiveness.

Read amplification refers to checking multiple SSTables for a single key. In the worst case, a read checks all L0 files (which may overlap) plus one file per level. If L0 has 4 files and there are 6 levels, that’s 10 potential SSTable reads. Bloom filters dramatically reduce this by eliminating SSTables that definitely don’t contain the key, but they can’t eliminate all checks due to false positives.

Compaction Strategies

Fjall supports Leveled Compaction (default) and Size-Tiered Compaction. The choice fundamentally trades write amplification against read amplification.

Leveled compaction maintains non-overlapping SSTables at L1 and below. When L0 grows too large, it selects one L0 file and all L1 files that overlap its key range, merges them, and outputs new L1 files. This process repeats for each level. The key property is that reading from L1+ requires checking at most one file per level because ranges don’t overlap. Write amplification is high because data gets rewritten at each level transition.

Size-tiered compaction groups SSTables of similar size into tiers and merges entire tiers when a threshold is reached. Multiple tiers can have overlapping key ranges, so reads must check all of them. However, data moves between tiers less frequently than in leveled compaction, reducing write amplification. This strategy suits write-heavy workloads where read performance is less critical, particularly time-series data where queries mostly access recent data.

// Configure compaction strategy
let partition = keyspace.open_partition(
    "timeseries",
    PartitionCreateOptions::default()
        .compaction_strategy(CompactionStrategy::SizeTiered)
        .level_ratio(4)  // More aggressive than default 10
)?;

Partitions and Column Families

Fjall’s partitions are identical to RocksDB’s column families. Both terms describe independent LSM trees that share a single Write-Ahead Log for atomic cross-partition writes. Each partition has its own MemTable, SSTables, and compaction schedule, but they coordinate through a shared WAL to provide atomicity.

let users = keyspace.open_partition("users", PartitionCreateOptions::default())?;
let posts = keyspace.open_partition("posts", PartitionCreateOptions::default())?;
 
// Atomic write across partitions
let tx = keyspace.write_tx();
tx.insert(&users, b"user:1", b"alice");
tx.insert(&posts, b"post:1", b"content");
tx.commit()?;  // Both writes or neither

The atomicity mechanism writes all transaction entries to the WAL with a commit marker, then applies them to their respective partition MemTables. On crash recovery, the WAL replay either sees the commit marker (applies all writes) or doesn’t (discards all writes).

Performance Comparison with RocksDB

Fjall achieves roughly 50% of RocksDB’s throughput in write-heavy workloads and 60% in read-heavy workloads. These differences stem from specific optimizations rather than the Rust vs C++ language choice. RocksDB has adaptive MemTable implementations (skip list, hash table, vector) selected based on workload, while Fjall uses only skip lists. RocksDB’s block cache implements more sophisticated eviction policies (LRU with multiple levels) compared to Fjall’s simpler LRU. RocksDB’s compaction scheduler uses write buffer size, L0 file count, and per-level write amplification to prioritize compactions, while Fjall uses simpler heuristics.

The gap narrows in scenarios where the bottleneck is I/O rather than CPU. For large value workloads (>4KB values) where serialization and disk I/O dominate, both systems perform within 10-20% of each other. Fjall’s advantage is deployment simplicity: a Rust binary statically linked with Fjall is self-contained, while RocksDB requires C++ shared libraries and careful version management.

Tip

Choose Fjall when you’re building a Rust application and don’t need RocksDB’s 100+ configuration knobs. Choose RocksDB when you need maximum performance or production-critical stability.

Key-Value Separation

For values larger than 1KB, storing them directly in SSTables causes excessive compaction overhead. Value log separation (vLog) stores keys and small values in SSTables, but puts large values in a separate append-only log. The SSTable contains a pointer (offset into vLog) instead of the actual value.

let keyspace = Config::new(folder)
    .enable_value_log()
    .value_log_threshold(1024)  // Values > 1KB go to vLog
    .open()?;

This optimization reduces write amplification for large values because compaction only rewrites pointers, not the values themselves. The trade-off is read latency: fetching a value requires reading the SSTable (for the pointer) then seeking to the vLog location. Additionally, the vLog requires garbage collection. When a key is deleted or updated, its old value in the vLog becomes garbage but isn’t immediately reclaimed. Fjall periodically scans the vLog, identifies garbage, and rewrites still-live values to compact the log.

MVCC Implementation

Fjall implements Multi-Version Concurrency Control through versioned internal keys. Each write receives a monotonically increasing sequence number from an atomic counter. The internal key format is (user_key, sequence, op_type) where op_type is either Put or Delete. Keys are sorted by user_key ascending, then sequence descending, placing the newest version first.

// Internal representation
struct InternalKey {
    user_key: Vec<u8>,
    sequence: u64,        // From atomic counter
    op_type: OpType,      // Put or Delete (tombstone)
}
 
impl Ord for InternalKey {
    fn cmp(&self, other: &Self) -> Ordering {
        self.user_key.cmp(&other.user_key)
            .then(other.sequence.cmp(&self.sequence))  // Reversed: newest first
    }
}

A snapshot captures the current sequence number. Reads through that snapshot only see entries with sequence <= snapshot_sequence. This provides consistent reads without blocking writes. Compaction removes old versions when no active snapshot can see them, which is determined by tracking the oldest snapshot’s sequence number.

The implementation is single-writer per partition despite using MVCC. Only one thread can acquire the write lock to insert into the MemTable and advance the sequence counter. MVCC enables concurrent readers, not concurrent writers.

Memory Footprint

A Fjall instance consumes memory through several components. The active MemTable defaults to 64MB and holds recent writes. Up to 3 immutable MemTables may exist simultaneously during flush operations (old MemTable flushing to disk while new writes continue), adding up to 192MB. The block cache (default 64MB) holds frequently accessed SSTable blocks. Index blocks and Bloom filters for all SSTables in L0-L2 stay in memory, consuming approximately 1% of the dataset size. Metadata overhead (partition info, file handles, compaction state) adds roughly 0.5%.

Total memory usage is approximately 150MB baseline plus 1.5% of the dataset size. A database with 100GB of data would use ~1.65GB of memory. This is tunable: increasing MemTable size reduces flush frequency and improves write batching, while increasing block cache size improves read performance.

Skip List Internals

LSM MemTables use concurrent skip lists because they provide sorted order with lock-free concurrent reads and writes. A skip list is a multi-level linked list where each node has a random height. Level 0 contains all elements. Level 1 contains roughly 50% of elements. Level 2 contains roughly 25%, and so on. This probabilistic structure provides expected O(log n) search by creating “express lanes” that skip over elements.

L2:  1 ----------------→ 56 ----------------→ 99
L1:  1 ----→ 23 -------→ 56 ----→ 77 -------→ 99
L0:  1 → 12 → 23 → 34 → 56 → 67 → 77 → 89 → 99

Each node stores forward pointers for all its levels using atomic pointers. Inserting a node involves traversing to find the insertion point (reading atomic pointers), allocating the new node with random height, then using compare-and-swap to atomically link it into each level from bottom to top.

struct Node<K, V> {
    key: K,
    value: V,
    forward: [AtomicPtr<Node<K, V>>; MAX_HEIGHT],
}
 
// Insert at specific level using CAS
fn link_at_level(pred: &Node, new: *mut Node, succ: *mut Node, level: usize) -> bool {
    pred.forward[level].compare_exchange(
        succ,                    // Expected current value
        new,                     // New value to set
        Ordering::SeqCst,
        Ordering::SeqCst
    ).is_ok()
}

The lock-free property means no thread holds locks that could block others. If a compare-and-swap fails because another thread modified the pointer, the inserting thread simply retries. This allows multiple concurrent insertions without serialization. Searches never modify pointers, so they never block or retry.

Memory Ordering in Concurrent Structures

The choice of memory ordering in atomic operations determines what guarantees threads have about seeing each other’s modifications. When a thread inserts a node, it must ensure that all the node’s forward pointers are set before making the node visible to other threads. This requires Release ordering on the final pointer update that links the node in.

// Step 1: Set new node's forward pointers (private, no synchronization needed)
(*new_node).forward[level].store(successor, Ordering::Relaxed);
 
// Step 2: Make node visible (Release ensures all prior writes are visible)
predecessor.forward[level].store(new_node, Ordering::Release);
 
// Search path: Acquire ensures we see all writes before the Release
let next = node.forward[level].load(Ordering::Acquire);

Without proper ordering, a thread might see the new node pointer before the node’s forward pointers are initialized, leading to dereferencing null or garbage values. SeqCst (Sequential Consistency) provides the strongest guarantees but has performance cost. Fjall uses Acquire/Release for performance while maintaining correctness.

Deletion in Skip Lists

Deleting from a lock-free skip list is more complex than insertion because a node being deleted might have concurrent threads traversing through it. The standard solution is logical deletion: marking the node as deleted without physically removing it. In Fjall’s MemTable, this is acceptable because the MemTable is short-lived (typically flushed within minutes), so logical deletions don’t accumulate significantly.

The implementation stores the value as an atomic pointer. Deletion sets this pointer to null atomically. Searches that encounter a null value skip that node. The node remains physically linked in the structure until the entire MemTable flushes to disk, at which point the MemTable is discarded and the memory reclaimed.

An alternative approach marks nodes with a flag in the pointer’s lower bits (exploiting alignment), then physically removes marked nodes in a subsequent pass. This is more complex but reduces memory usage for long-lived structures.

Why Skip Lists Over Alternatives

The alternatives for a concurrent sorted structure are lock-free B-trees and synchronized balanced trees. Lock-free B-trees are substantially more complex to implement correctly due to node splitting, merging, and rebalancing operations that must maintain consistency without locks. The implementation complexity makes them error-prone and difficult to verify.

Synchronized alternatives like RwLock<BTreeMap> serialize all writes through a single lock, eliminating concurrency benefits. While readers can proceed concurrently, writes block all other writes and wait for readers to finish. In write-heavy LSM workloads, this serialization becomes a bottleneck.

Skip lists provide a pragmatic middle ground: simpler than lock-free B-trees, more concurrent than synchronized trees, with acceptable performance characteristics. The O(log n) expected complexity is sufficient for MemTable operations, and the probabilistic nature rarely degrades to worst-case O(n) in practice.

Edmondo's Vault

Explorer

Fjall

LSM Tree Architecture

Amplification Properties

Compaction Strategies

Partitions and Column Families

Performance Comparison with RocksDB

Key-Value Separation

MVCC Implementation

Memory Footprint

Skip List Internals

Memory Ordering in Concurrent Structures

Deletion in Skip Lists

Why Skip Lists Over Alternatives

Graph View

Table of Contents

Backlinks