Hash Index

Hash indexes are designed to map keys to storage locations quickly, making them ideal for applications requiring fast lookups. Key use cases include key-value stores, such as Redis or DynamoDB, where data retrieval by key is critical for performance. They are also useful in scenarios like in-memory caching, where fast access to recently used data is necessary, and in deduplication systems, where hash indexes can efficiently detect duplicates by storing unique hashes.

In relational databases, hash indexes optimize equality lookups (e.g., WHERE id = 123). Unlike tree-based indexes (e.g., B-trees), hash indexes are not suitable for range queries, as they lack an inherent order.

How a Hash Index Works

A hash index relies on a hash function to compute a numeric value (hash code) from a key. This hash code determines the location of the value in an array of storage locations, known as buckets. For example, a hash function applied to the key "key1" might produce the value 1234. If there are 10 buckets, the bucket number is calculated as 1234 % 10, placing "key1" in bucket 4.

When multiple keys hash to the same bucket—a situation called a collision—the index must resolve this. Common strategies include chaining, where each bucket stores a list of key-value pairs, or open addressing, where the system searches for the next available bucket. Chaining is more common due to its simplicity and scalability.

Distributed Hash Indexes

In distributed systems, hash indexes are sharded across multiple nodes. Consistent hashing is a common approach, dividing the hash space into segments assigned to nodes. When a node is added or removed, only a minimal number of keys need to be redistributed. For fault tolerance, buckets are replicated across nodes.

At its core, consistent hashing maps both keys and nodes onto the same circular hash space. Keys are then assigned to the first node that appears on the circle when moving clockwise.

Basic consistent hashing

In basic consistent hashing, each node is represented by a single point on the hash ring. This creates two main challenges:

Uneven Load Distribution: If hash values aren’t evenly distributed, some nodes may handle significantly more keys than others.
Limited Scalability: Adding or removing nodes can result in uneven shifts of keys between adjacent nodes.

Virtual Nodes (vnodes) to the Rescue

To solve these problems, consistent hashing often uses virtual nodes. Instead of each physical node having a single point on the hash ring, each node is assigned multiple virtual points. For example, a single physical node might have 100 virtual nodes distributed evenly around the hash space.

How Virtual Nodes Work

Mapping: Each physical node corresponds to multiple virtual nodes, which are spread across the hash ring.
Key Assignment: Keys are assigned to virtual nodes, which are then mapped back to their corresponding physical nodes.
Load Balancing: By distributing virtual nodes uniformly across the ring, the load is more evenly balanced, even if the physical nodes have differing capacities.

Example

Assume three physical nodes (NodeA, NodeB, NodeC) and a hash ring divided into 360 degrees.
Without virtual nodes, NodeA might occupy 0–120°, NodeB 120–240°, and NodeC 240–360°. If NodeB is removed, all keys from 120–240° shift to NodeC, creating a hotspot.
With virtual nodes, NodeA could occupy hash points like 0°, 120°, 240°, and NodeB might occupy 30°, 150°, 270°. Removing NodeB only shifts keys assigned to its specific virtual nodes.

How Distributed Systems Use Virtual Nodes

Cassandra

Each node is responsible for multiple ranges in the hash space, thanks to virtual nodes.
This ensures even data distribution and minimizes the impact of adding/removing nodes.

DynamoDB

Virtual nodes are used to manage partitions and replicate data across multiple physical nodes for fault tolerance and scalability.

Memcached with Ketama

Memcached’s Ketama hashing algorithm uses consistent hashing with virtual nodes to distribute keys across servers while minimizing rebalancing.

Edmondo's Vault

Explorer