HyperLogLog is a probabilistic data structure designed for counting unique elements in a dataset with minimal memory usage. Unlike a traditional hash set, which requires storing every unique element, HyperLogLog only keeps a compressed representation of the data, allowing it to estimate cardinality with a small, fixed memory footprint.

Cardinality Estimation Without Storing Elements

A perfect hash function uniformly distributes input elements across a fixed-bit space. The probability that a randomly generated hash starts with k leading zeros follows a predictable pattern:

  • The probability of a hash starting with k zeros is 1 / 2^k.
  • If there are N unique elements, the probability that at least one hash has log₂(N) leading zeros is very high.

This means that by observing the longest sequence of leading zeros in a hash, we can approximate the number of unique elements in the dataset. However, relying on a single hash value would introduce large variance, so HyperLogLog splits elements into multiple buckets and averages across them to reduce error.

Algorithm Implementation

Each element is hashed into a fixed-length bitstring. The position of the leftmost 1-bit in the hash determines how “random” the dataset appears. This information is aggregated across multiple buckets to refine the estimate.

import hashlib
import math
 
class HyperLogLog:
    def __init__(self, num_buckets=16):
        self.num_buckets = num_buckets
        self.buckets = [0] * num_buckets
 
    def _hash(self, item):
        hash_value = int(hashlib.md5(str(item).encode()).hexdigest(), 16)
        return hash_value & 0xFFFFFFFF
    
    def _leading_zeros(self, n):
        return len(bin(n)[2:].zfill(32)) - len(bin(n)[2:].lstrip('0'))
 
    def add(self, item):
        h = self._hash(item)
        bucket_index = h % self.num_buckets
        value = h >> int(math.log2(self.num_buckets))
        self.buckets[bucket_index] = max(self.buckets[bucket_index], self._leading_zeros(value) + 1)
 
    def estimate(self):
        alpha = 0.7213 / (1 + 1.079 / self.num_buckets)
        raw_estimate = alpha * self.num_buckets**2 / sum(2 ** -b for b in self.buckets)
        
        if raw_estimate <= 2.5 * self.num_buckets:
            v = self.buckets.count(0)
            if v > 0:
                return self.num_buckets * math.log(self.num_buckets / v)
        
        return raw_estimate
 
hll = HyperLogLog(num_buckets=16)
for i in range(1000):
    hll.add(f"user_{i}")
print(f"Estimated unique count: {hll.estimate()}")

This implementation does not store individual elements. Instead, it tracks the maximum leading zero count per bucket, which allows an approximation of the dataset’s cardinality.

Memory Efficiency and Accuracy Trade-offs

The HyperLogLog algorithm achieves an error rate of approximately 1.6%. The memory requirement is constant, meaning it does not grow with the dataset size.

A hash set storing every unique item requires O(n) memory, whereas HyperLogLog maintains a compact O(1) structure. This trade-off makes it ideal for scenarios where exact accuracy is not required but efficiency is crucial.

Real-World Applications

Redis HyperLogLog (PFADD / PFCOUNT)

Redis provides a built-in HyperLogLog implementation using PFADD (to insert elements) and PFCOUNT (to estimate cardinality). This is useful for tracking large-scale analytics such as unique visitors, distinct search queries, or IP addresses.

import redis
 
r = redis.Redis(host='localhost', port=6379, decode_responses=True)
 
r.pfadd("unique_visitors", "user_1", "user_2", "user_3")
print(f"Estimated unique visitors: {r.pfcount('unique_visitors')}")

In Redis, HyperLogLog uses only 12 KB of memory per key and supports billions of distinct values.

Use Cases in Large-Scale Systems

HyperLogLog is widely used in large-scale applications where exact counting is infeasible:

  • Web Analytics: Counting unique visitors across millions of requests.
  • Database Query Logs: Tracking distinct SQL queries issued over time.
  • Network Monitoring: Estimating unique IP addresses in high-volume traffic.
  • Fraud Detection: Identifying distinct users across multiple services.

Its efficiency makes it a fundamental tool for big data applications, enabling approximate distinct counting at web scale.