Hash indexes are designed to map keys to storage locations quickly, making them ideal for applications requiring fast lookups. Key use cases include key-value stores, such as Redis or DynamoDB, where data retrieval by key is critical for performance. They are also useful in scenarios like in-memory caching, where fast access to recently used data is necessary, and in deduplication systems, where hash indexes can efficiently detect duplicates by storing unique hashes.
In relational databases, hash indexes optimize equality lookups (e.g., WHERE id = 123). Unlike tree-based indexes (e.g., B-trees), hash indexes are not suitable for range queries, as they lack an inherent order.
How a Hash Index Works
A hash index relies on a hash function to compute a numeric value (hash code) from a key. This hash code determines the location of the value in an array of storage locations, known as buckets. For example, a hash function applied to the key "key1" might produce the value 1234. If there are 10 buckets, the bucket number is calculated as 1234 % 10, placing "key1" in bucket 4.
When multiple keys hash to the same bucket—a situation called a collision—the index must resolve this. Common strategies include chaining, where each bucket stores a list of key-value pairs, or open addressing, where the system searches for the next available bucket. Chaining is more common due to its simplicity and scalability.
Distributed Hash Indexes
In distributed systems, hash indexes are sharded across multiple nodes. Consistent hashing is a common approach, dividing the hash space into segments assigned to nodes. When a node is added or removed, only a minimal number of keys need to be redistributed. For fault tolerance, buckets are replicated across nodes.
At its core, consistent hashing maps both keys and nodes onto the same circular hash space. Keys are then assigned to the first node that appears on the circle when moving clockwise.
Basic consistent hashing
In basic consistent hashing, each node is represented by a single point on the hash ring. This creates two main challenges:
- Uneven Load Distribution: If hash values aren’t evenly distributed, some nodes may handle significantly more keys than others.
- Limited Scalability: Adding or removing nodes can result in uneven shifts of keys between adjacent nodes.
Virtual Nodes (vnodes) to the Rescue
To solve these problems, consistent hashing often uses virtual nodes. Instead of each physical node having a single point on the hash ring, each node is assigned multiple virtual points. For example, a single physical node might have 100 virtual nodes distributed evenly around the hash space.
How Virtual Nodes Work
- Mapping: Each physical node corresponds to multiple virtual nodes, which are spread across the hash ring.
- Key Assignment: Keys are assigned to virtual nodes, which are then mapped back to their corresponding physical nodes.
- Load Balancing: By distributing virtual nodes uniformly across the ring, the load is more evenly balanced, even if the physical nodes have differing capacities.
Example
- Assume three physical nodes (
NodeA,NodeB,NodeC) and a hash ring divided into 360 degrees. - Without virtual nodes,
NodeAmight occupy 0–120°,NodeB120–240°, andNodeC240–360°. IfNodeBis removed, all keys from 120–240° shift toNodeC, creating a hotspot. - With virtual nodes,
NodeAcould occupy hash points like 0°, 120°, 240°, andNodeBmight occupy 30°, 150°, 270°. RemovingNodeBonly shifts keys assigned to its specific virtual nodes.
How Distributed Systems Use Virtual Nodes
Cassandra
- Each node is responsible for multiple ranges in the hash space, thanks to virtual nodes.
- This ensures even data distribution and minimizes the impact of adding/removing nodes.
DynamoDB
- Virtual nodes are used to manage partitions and replicate data across multiple physical nodes for fault tolerance and scalability.
Memcached with Ketama
- Memcached’s Ketama hashing algorithm uses consistent hashing with virtual nodes to distribute keys across servers while minimizing rebalancing.