Designing Metrics to Avoid High Cardinality
Cardinality is the number of distinct time series a TSDB stores. Every unique combination of label values creates a new series. A metric with 3 methods, 4 statuses, and 5 handlers produces 60 series. Add a user_id label with 100,000 users and cardinality jumps to 6 million.
High cardinality is a problem because of how TSDBs find data at query time. They use an inverted index — a reverse map from (label_name, label_value) to a list of matching series IDs (the posting list). When a label has few unique values, each posting list covers many series and the index acts as a fast shortcut. When a label has as many unique values as there are series, each posting list has exactly one entry — the index is the same size as the data and offers no lookup advantage over a full scan.
Prerequisite
The TSDB Internals notebook has an interactive section that lets you see this concretely — drag the
user_idcardinality slider and watch the inverted index explode while query performance stays flat.
The rest of this note covers what you can do about it: schema design choices, query-time strategies, and infrastructure-level approaches.
1. The design question: does this dimension belong in metrics?
Before reaching for any mitigation, ask: will I aggregate over this label?
Metrics exist for aggregate signals — error rates, latency percentiles, throughput per endpoint. If the answer to sum(metric) by (some_label) is “one value per label value” with no further reduction, you’re not aggregating. You’re building a database row per entity and paying TSDB prices for it.
| Dimension | Aggregate? | Belongs in |
|---|---|---|
method (GET, POST, DELETE) | Yes — sum by (method) groups 1000s of series into 3 buckets | Metric label |
status (200, 404, 500) | Yes — error rate = rate({status=~"5.."}) / rate(total) | Metric label |
host (500 machines) | Maybe — depends on fleet size and whether you query per-host | Metric label if <1,000; see partitioning below if more |
user_id (100k users) | No — each user is one series, no aggregation possible | Logs, OLAP store, or relational DB |
request_id (every request) | No — infinite cardinality, each value appears once | Tracing system (Jaeger, Tempo) |
The distinction is not about the number of values — it’s about whether the inverted index can provide leverage. A label with 500 values where queries typically match 50 of them gives 10x leverage. A label with 500 values where every query matches exactly 1 gives no leverage.
2. Schema design: controlling cardinality at the source
Recognize the labels that actually cause problems
High cardinality rarely comes from obviously stupid choices. It comes from labels that seem reasonable at first but grow unboundedly in production. The usual suspects:
| Label | Why it seems fine | Why it explodes |
|---|---|---|
pod (Kubernetes) | “I need per-pod metrics” | Every deploy creates new pod names. A daily deploy of 50 services × 3 replicas = 150 new series per metric per day. Old pod series go stale but stay in the index until retention expires. |
container_id | ”For debugging OOM kills” | Every container restart generates a new ID. A crashlooping pod creates hundreds of series in hours. |
url or path | ”Per-endpoint latency” | If paths contain IDs (/api/users/12345/orders), every user creates a new series. A REST API with 100k users and 5 endpoints = 500k series from one metric. |
ip or instance | ”Per-host monitoring” | Fine at 100 hosts. At 10,000 hosts with 500 metrics each, that’s 5 million series — all legitimate, all needed, and all straining the index. |
email | ”Track per-customer errors” | Unbounded. Every new customer adds series that never aggregate. |
The pod and container_id cases are especially insidious because they create churn — a steady stream of new series replacing old ones. The TSDB’s inverted index grows with every new series, even if the old ones are no longer receiving samples. Most TSDBs only clean up stale index entries during compaction or after retention expiry.
Drop labels at ingestion that you never query by
Exporters often emit labels that are useful for debugging but never appear in dashboard queries. Every label you keep multiplies cardinality.
VictoriaMetrics’s -relabelConfig and Prometheus’s metric_relabel_configs let you drop or rewrite labels before they hit the index:
# Drop the 'pod' label from all kube-state-metrics series
- source_labels: [__name__]
regex: 'kube_.*'
action: labeldrop
regex: 'pod'Audit your labels periodically. If a label hasn’t appeared in a query in 30 days, it’s a candidate for dropping.
Separate high-cardinality dimensions into different metric names
Sometimes you need both an aggregate view and a per-entity view. Instead of one metric with a high-cardinality label, create two metrics:
# Low-cardinality: for dashboards and alerting
http_requests_total{method="GET", handler="/api/users", status="200"}
# High-cardinality: for ad-hoc debugging, short retention
http_requests_by_user_total{user_id="u_1001"}
Apply different retention policies: 1 year for the aggregate, 7 days for the per-user series. The short-lived high-cardinality metric won’t accumulate enough index weight to become a problem.
3. Query-time strategies: recording rules and pre-aggregation
Recording rules
If every dashboard shows sum(http_requests_total) by (handler, status), that query scans thousands of raw series on every refresh. A recording rule materializes the aggregation as a new, low-cardinality series:
groups:
- name: request_aggregates
interval: 15s
rules:
- record: http_requests:by_handler_status:rate5m
expr: sum(rate(http_requests_total[5m])) by (handler, status)The TSDB evaluates this expression once every 15 seconds and stores the result as a new metric with only handler × status series. Dashboards query http_requests:by_handler_status:rate5m — a handful of series — instead of scanning the raw high-cardinality data.
Recording rules don’t reduce storage cardinality (the raw series still exist), but they reduce query-time cost dramatically. Combine with short retention on the raw data for full effect.
Downsampling
VictoriaMetrics’s -downsampling.period flag automatically replaces raw samples with aggregates after a retention threshold. For example, keep 1-second resolution for 7 days, then downsample to 1-minute resolution for long-term storage. This reduces the number of samples (not series), but combined with recording rules it shrinks the query surface significantly.
4. Infrastructure: distributing the index when you can’t reduce it
Sometimes cardinality is irreducible. If you monitor 10,000 hosts, you have 10,000 values for the instance label and you genuinely query by instance. You can’t drop it, bucket it, or aggregate it away.
Partitioning by tenant or namespace
Split the data into independent TSDB instances, each handling a subset:
- Per-team instances: Team A’s Prometheus/VictoriaMetrics stores team A’s metrics. Each instance sees only its slice of the cardinality.
- Federated Prometheus: A top-level Prometheus scrapes aggregated
/federateendpoints from per-team instances. Dashboards showing team-level metrics hit the small federation; drill-down queries hit the team’s instance. - VictoriaMetrics multi-tenancy: The cluster version supports native multi-tenancy via URL-path routing (
/insert/1/...,/select/1/...). Each tenant’s inverted index is independent.
Partitioning doesn’t reduce total work — it distributes it so no single index becomes unmanageable. The cost is operational complexity (more instances to run) and cross-partition queries become harder or impossible.
Sharding the inverted index
In a clustered TSDB like VictoriaMetrics, the inverted index can be sharded across vmselect and vmstorage nodes. A query for {instance=~"host-.*"} is scatter-gathered: each shard searches its local index and returns partial results, which vmselect merges.
This introduces latency — the query now involves network round-trips and a merge step. But it means no single node needs to hold the entire index in memory. The tradeoff:
| Approach | Index memory per node | Query latency | Operational cost |
|---|---|---|---|
| Single node, full index | O(cardinality) | Low (local lookup) | Low |
| Sharded across N nodes | O(cardinality / N) | Higher (scatter-gather + merge) | Higher (N nodes) |
| Partitioned by tenant | O(cardinality / tenants) | Low within partition | Higher (many instances) |
The sharding approach scales the index horizontally but trades single-node lookup speed for distributed coordination. For most deployments, partitioning by tenant or team is simpler and sufficient.
Caching posting lists
Some TSDBs (including VictoriaMetrics) cache frequently-accessed posting lists in memory. If the same high-cardinality query runs repeatedly (e.g., a dashboard auto-refreshing every 30 seconds), the posting list intersection result is cached and subsequent queries skip the index entirely.
This doesn’t solve the cold-query problem (first query is still slow), but it makes dashboards fast for repeated access patterns. VictoriaMetrics exposes the cache hit rate via vm_cache_requests_total and vm_cache_misses_total — monitor these to know whether caching is helping.
Summary
| Strategy | When to use | Reduces cardinality? | Reduces query cost? |
|---|---|---|---|
| Remove dimension from labels | Dimension has no aggregate value | Yes | Yes |
| Drop unused labels at ingestion | Labels never queried | Yes | Yes |
| Short retention for high-card metrics | Need per-entity data temporarily | Limits growth | Limits growth |
| Recording rules | Dashboards repeatedly aggregate same data | No | Yes |
| Partition by tenant/team | Irreducible cardinality from org structure | Per-partition | Per-partition |
| Shard the index (cluster mode) | Single-node index exceeds memory | Per-node | Adds latency |
| Posting list caching | Repeated queries on hot data | No | Yes (after first hit) |
The most impactful strategies are at the top of this list. Get the schema right — keep only labels you aggregate over — and most cardinality problems never arise.
See also
- VictoriaMetrics — architecture, compression, and operational details
- TSDB Internals notebook — interactive exploration of compression and the inverted index
- LSM Compaction — write amplification and how partitioning limits it