Hash Tables

Hash Tables are critical in databases and there are a variety of strategy to implement them. In particular, static hashing schemes assume a number of elements known ahead of time while dynamic hashing schemes do not require to rebuild the entire table if its needs to grow or shrink in size

Static Hashing Schemes

Linear Probing

Linear probing resolves collisions using open addressing, meaning all elements are stored directly in the table. When a collision occurs, the algorithm searches for the next available slot by probing sequentially (i.e., linearly) until it finds an empty slot.

The primary issue is primary clustering: contiguous runs of occupied slots increase the likelihood of further collisions, leading to degraded performance under high load factors. The worst case can degrade to O(N) lookup time. Variants such as quadratic probing and double hashing attempt to reduce clustering by spreading out the probe sequence.

class LinearProbingHashTable:
    def __init__(self, size):
        self.size = size
        self.table = [None] * size
 
    def insert(self, key, value):
        index = hash(key) % self.size
        while self.table[index] is not None:
            index = (index + 1) % self.size
        self.table[index] = (key, value)

Robin Hood Hashing

Robin Hood hashing is an enhanced linear probing scheme that aims to equalize probe sequence lengths across keys. If a new key has a shorter probe sequence than an already present key, it steals the slot, displacing the longer-probed entry.

This ensures that variance in probe lengths is minimized, reducing lookup time for worst-case keys. However, insertion is slightly more expensive as it requires checking probe distances and potentially shifting elements.

class RobinHoodHashTable:
    def __init__(self, size):
        self.size = size
        self.table = [None] * size
 
    def insert(self, key, value):
        index = hash(key) % self.size
        probe_length = 0
 
        while self.table[index] is not None:
            existing_key, existing_value, existing_probe = self.table[index]
            if probe_length > existing_probe:
                self.table[index], (key, value, probe_length) = (key, value, probe_length), self.table[index]
            index = (index + 1) % self.size
            probe_length += 1
 
        self.table[index] = (key, value, probe_length)

Cuckoo Hashing

Cuckoo hashing uses two or more hash functions to provide multiple possible locations for each key. If both locations are occupied during insertion, an existing key is evicted and reinserted into its alternative location, potentially displacing another key and creating a cascading effect.

The advantage is O(1) lookups, but insertion may require multiple relocations, and in the worst case, can result in cycles that require rehashing the entire table with new hash functions.

class CuckooHashTable:
    def __init__(self, size):
        self.size = size
        self.table1 = [None] * size
        self.table2 = [None] * size
 
    def _hash1(self, key): return hash(key) % self.size
    def _hash2(self, key): return (hash(key) // self.size) % self.size
 
    def insert(self, key, value, depth=0):
        if depth > self.size:
            raise RuntimeError("Cycle detected, need rehashing")
 
        index1, index2 = self._hash1(key), self._hash2(key)
        if self.table1[index1] is None:
            self.table1[index1] = (key, value)
        elif self.table2[index2] is None:
            self.table2[index2] = (key, value)
        else:
            displaced = self.table1[index1]
            self.table1[index1] = (key, value)
            self.insert(displaced[0], displaced[1], depth + 1)

Dynamic Hashing SChemes

Chained Hashing

Chained hashing stores multiple values in the same slot using linked lists (or dynamic arrays). Instead of resolving collisions by moving keys, it groups colliding keys together within the same bucket.

This eliminates clustering issues and allows the hash table to grow dynamically. However, performance depends on the length of chains, which grows with the load factor.

class ChainedHashTable:
    def __init__(self, size):
        self.size = size
        self.table = [[] for _ in range(size)]
 
    def insert(self, key, value):
        index = hash(key) % self.size
        self.table[index].append((key, value))

Extendible Hashing

Extendible hashing is a dynamic, resizable hashing scheme based on a directory structure. It maintains a global depth (GD) and local depth (LD) per bucket.

As an example, we start with a GD=3 and a total directory of 8 entries, but a single bucket, therefore LD = 0.

At the beginning all the entries in the directory point to a single bucket
After we add few elements, the bucket will be full and split
Our LD will increase to 1, now we are going to use one bit (most significant or least significant) from the hash to decide where keys will be stored
Since our hash function is good, some elements will be re-arranged among buckets
The directory will be updated so that 4 entries point to bucket A, and 4 entries point to bucket B (depending on the key prefix!)
At a certain point, as our table grows, we will have LD = 3 = GD, and when we need to split, we will need LD = 4, but our directory only uses three bits. This is the moment where we will need to resize the entire directory

Important

When a bucket overflows, it is split, and the directory is updated without rehashing the entire table. This makes extendible hashing particularly useful in database indexing and file systems, where the ability to grow dynamically is crucial.

Linear Hashing

Linear hashing provides incremental growth without requiring full rehashing. Instead of expanding the entire table, it gradually splits buckets using a split pointer (n).

The hash function dynamically adapts: h(key) = key % (2^d), and only the overflowing bucket is split. This allows for efficient resizing, making it well-suited for database systems.

class LinearHashTable:
    def __init__(self, initial_size=4):
        self.size = initial_size
        self.table = [[] for _ in range(self.size)]
        self.split_pointer = 0
 
    def _hash(self, key):
        return key % self.size if key % self.size >= self.split_pointer else key % (2 * self.size)
 
    def insert(self, key, value):
        index = self._hash(key)
        self.table[index].append((key, value))
 
        if len(self.table[self.split_pointer]) > 2:
            self._split()
 
    def _split(self):
        self.table.append([])
        old_bucket = self.table[self.split_pointer]
        self.table[self.split_pointer] = []
        for key, value in old_bucket:
            self.insert(key, value)
        self.split_pointer += 1

Edmondo's Vault

Explorer