HyperLogLog is a probabilistic data structure designed for counting unique elements in a dataset with minimal memory usage. Unlike a traditional hash set, which requires storing every unique element, HyperLogLog only keeps a compressed representation of the data, allowing it to estimate cardinality with a small, fixed memory footprint.
Cardinality Estimation Without Storing Elements
A perfect hash function uniformly distributes input elements across a fixed-bit space. The probability that a randomly generated hash starts with k leading zeros follows a predictable pattern:
- The probability of a hash starting with k zeros is 1 / 2^k.
- If there are N unique elements, the probability that at least one hash has log₂(N) leading zeros is very high.
This means that by observing the longest sequence of leading zeros in a hash, we can approximate the number of unique elements in the dataset. However, relying on a single hash value would introduce large variance, so HyperLogLog splits elements into multiple buckets and averages across them to reduce error.
Algorithm Implementation
Each element is hashed into a fixed-length bitstring. The position of the leftmost 1-bit in the hash determines how “random” the dataset appears. This information is aggregated across multiple buckets to refine the estimate.
import hashlib
import math
class HyperLogLog:
def __init__(self, num_buckets=16):
self.num_buckets = num_buckets
self.buckets = [0] * num_buckets
def _hash(self, item):
hash_value = int(hashlib.md5(str(item).encode()).hexdigest(), 16)
return hash_value & 0xFFFFFFFF
def _leading_zeros(self, n):
return len(bin(n)[2:].zfill(32)) - len(bin(n)[2:].lstrip('0'))
def add(self, item):
h = self._hash(item)
bucket_index = h % self.num_buckets
value = h >> int(math.log2(self.num_buckets))
self.buckets[bucket_index] = max(self.buckets[bucket_index], self._leading_zeros(value) + 1)
def estimate(self):
alpha = 0.7213 / (1 + 1.079 / self.num_buckets)
raw_estimate = alpha * self.num_buckets**2 / sum(2 ** -b for b in self.buckets)
if raw_estimate <= 2.5 * self.num_buckets:
v = self.buckets.count(0)
if v > 0:
return self.num_buckets * math.log(self.num_buckets / v)
return raw_estimate
hll = HyperLogLog(num_buckets=16)
for i in range(1000):
hll.add(f"user_{i}")
print(f"Estimated unique count: {hll.estimate()}")This implementation does not store individual elements. Instead, it tracks the maximum leading zero count per bucket, which allows an approximation of the dataset’s cardinality.
Memory Efficiency and Accuracy Trade-offs
The HyperLogLog algorithm achieves an error rate of approximately 1.6%. The memory requirement is constant, meaning it does not grow with the dataset size.
A hash set storing every unique item requires O(n) memory, whereas HyperLogLog maintains a compact O(1) structure. This trade-off makes it ideal for scenarios where exact accuracy is not required but efficiency is crucial.
Real-World Applications
Redis HyperLogLog (PFADD / PFCOUNT)
Redis provides a built-in HyperLogLog implementation using PFADD (to insert elements) and PFCOUNT (to estimate cardinality). This is useful for tracking large-scale analytics such as unique visitors, distinct search queries, or IP addresses.
import redis
r = redis.Redis(host='localhost', port=6379, decode_responses=True)
r.pfadd("unique_visitors", "user_1", "user_2", "user_3")
print(f"Estimated unique visitors: {r.pfcount('unique_visitors')}")In Redis, HyperLogLog uses only 12 KB of memory per key and supports billions of distinct values.
Use Cases in Large-Scale Systems
HyperLogLog is widely used in large-scale applications where exact counting is infeasible:
- Web Analytics: Counting unique visitors across millions of requests.
- Database Query Logs: Tracking distinct SQL queries issued over time.
- Network Monitoring: Estimating unique IP addresses in high-volume traffic.
- Fraud Detection: Identifying distinct users across multiple services.
Its efficiency makes it a fundamental tool for big data applications, enabling approximate distinct counting at web scale.