Designing Metrics to Avoid High Cardinality

Cardinality is the number of distinct time series a TSDB stores. Every unique combination of label values creates a new series. A metric with 3 methods, 4 statuses, and 5 handlers produces 60 series. Add a user_id label with 100,000 users and cardinality jumps to 6 million.

High cardinality is a problem because of how TSDBs find data at query time. They use an inverted index — a reverse map from (label_name, label_value) to a list of matching series IDs (the posting list). When a label has few unique values, each posting list covers many series and the index acts as a fast shortcut. When a label has as many unique values as there are series, each posting list has exactly one entry — the index is the same size as the data and offers no lookup advantage over a full scan.

Prerequisite

The TSDB Internals notebook has an interactive section that lets you see this concretely — drag the user_id cardinality slider and watch the inverted index explode while query performance stays flat.

The rest of this note covers what you can do about it: schema design choices, query-time strategies, and infrastructure-level approaches.

1. The design question: does this dimension belong in metrics?

Before reaching for any mitigation, ask: will I aggregate over this label?

Metrics exist for aggregate signals — error rates, latency percentiles, throughput per endpoint. If the answer to sum(metric) by (some_label) is “one value per label value” with no further reduction, you’re not aggregating. You’re building a database row per entity and paying TSDB prices for it.

DimensionAggregate?Belongs in
method (GET, POST, DELETE)Yes — sum by (method) groups 1000s of series into 3 bucketsMetric label
status (200, 404, 500)Yes — error rate = rate({status=~"5.."}) / rate(total)Metric label
host (500 machines)Maybe — depends on fleet size and whether you query per-hostMetric label if <1,000; see partitioning below if more
user_id (100k users)No — each user is one series, no aggregation possibleLogs, OLAP store, or relational DB
request_id (every request)No — infinite cardinality, each value appears onceTracing system (Jaeger, Tempo)

The distinction is not about the number of values — it’s about whether the inverted index can provide leverage. A label with 500 values where queries typically match 50 of them gives 10x leverage. A label with 500 values where every query matches exactly 1 gives no leverage.

2. Schema design: controlling cardinality at the source

Recognize the labels that actually cause problems

High cardinality rarely comes from obviously stupid choices. It comes from labels that seem reasonable at first but grow unboundedly in production. The usual suspects:

LabelWhy it seems fineWhy it explodes
pod (Kubernetes)“I need per-pod metrics”Every deploy creates new pod names. A daily deploy of 50 services × 3 replicas = 150 new series per metric per day. Old pod series go stale but stay in the index until retention expires.
container_id”For debugging OOM kills”Every container restart generates a new ID. A crashlooping pod creates hundreds of series in hours.
url or path”Per-endpoint latency”If paths contain IDs (/api/users/12345/orders), every user creates a new series. A REST API with 100k users and 5 endpoints = 500k series from one metric.
ip or instance”Per-host monitoring”Fine at 100 hosts. At 10,000 hosts with 500 metrics each, that’s 5 million series — all legitimate, all needed, and all straining the index.
email”Track per-customer errors”Unbounded. Every new customer adds series that never aggregate.

The pod and container_id cases are especially insidious because they create churn — a steady stream of new series replacing old ones. The TSDB’s inverted index grows with every new series, even if the old ones are no longer receiving samples. Most TSDBs only clean up stale index entries during compaction or after retention expiry.

Drop labels at ingestion that you never query by

Exporters often emit labels that are useful for debugging but never appear in dashboard queries. Every label you keep multiplies cardinality.

VictoriaMetrics’s -relabelConfig and Prometheus’s metric_relabel_configs let you drop or rewrite labels before they hit the index:

# Drop the 'pod' label from all kube-state-metrics series
- source_labels: [__name__]
  regex: 'kube_.*'
  action: labeldrop
  regex: 'pod'

Audit your labels periodically. If a label hasn’t appeared in a query in 30 days, it’s a candidate for dropping.

Separate high-cardinality dimensions into different metric names

Sometimes you need both an aggregate view and a per-entity view. Instead of one metric with a high-cardinality label, create two metrics:

# Low-cardinality: for dashboards and alerting
http_requests_total{method="GET", handler="/api/users", status="200"}

# High-cardinality: for ad-hoc debugging, short retention
http_requests_by_user_total{user_id="u_1001"}

Apply different retention policies: 1 year for the aggregate, 7 days for the per-user series. The short-lived high-cardinality metric won’t accumulate enough index weight to become a problem.

3. Query-time strategies: recording rules and pre-aggregation

Recording rules

If every dashboard shows sum(http_requests_total) by (handler, status), that query scans thousands of raw series on every refresh. A recording rule materializes the aggregation as a new, low-cardinality series:

groups:
  - name: request_aggregates
    interval: 15s
    rules:
      - record: http_requests:by_handler_status:rate5m
        expr: sum(rate(http_requests_total[5m])) by (handler, status)

The TSDB evaluates this expression once every 15 seconds and stores the result as a new metric with only handler × status series. Dashboards query http_requests:by_handler_status:rate5m — a handful of series — instead of scanning the raw high-cardinality data.

Recording rules don’t reduce storage cardinality (the raw series still exist), but they reduce query-time cost dramatically. Combine with short retention on the raw data for full effect.

Downsampling

VictoriaMetrics’s -downsampling.period flag automatically replaces raw samples with aggregates after a retention threshold. For example, keep 1-second resolution for 7 days, then downsample to 1-minute resolution for long-term storage. This reduces the number of samples (not series), but combined with recording rules it shrinks the query surface significantly.

4. Infrastructure: distributing the index when you can’t reduce it

Sometimes cardinality is irreducible. If you monitor 10,000 hosts, you have 10,000 values for the instance label and you genuinely query by instance. You can’t drop it, bucket it, or aggregate it away.

Partitioning by tenant or namespace

Split the data into independent TSDB instances, each handling a subset:

  • Per-team instances: Team A’s Prometheus/VictoriaMetrics stores team A’s metrics. Each instance sees only its slice of the cardinality.
  • Federated Prometheus: A top-level Prometheus scrapes aggregated /federate endpoints from per-team instances. Dashboards showing team-level metrics hit the small federation; drill-down queries hit the team’s instance.
  • VictoriaMetrics multi-tenancy: The cluster version supports native multi-tenancy via URL-path routing (/insert/1/..., /select/1/...). Each tenant’s inverted index is independent.

Partitioning doesn’t reduce total work — it distributes it so no single index becomes unmanageable. The cost is operational complexity (more instances to run) and cross-partition queries become harder or impossible.

Sharding the inverted index

In a clustered TSDB like VictoriaMetrics, the inverted index can be sharded across vmselect and vmstorage nodes. A query for {instance=~"host-.*"} is scatter-gathered: each shard searches its local index and returns partial results, which vmselect merges.

This introduces latency — the query now involves network round-trips and a merge step. But it means no single node needs to hold the entire index in memory. The tradeoff:

ApproachIndex memory per nodeQuery latencyOperational cost
Single node, full indexO(cardinality)Low (local lookup)Low
Sharded across N nodesO(cardinality / N)Higher (scatter-gather + merge)Higher (N nodes)
Partitioned by tenantO(cardinality / tenants)Low within partitionHigher (many instances)

The sharding approach scales the index horizontally but trades single-node lookup speed for distributed coordination. For most deployments, partitioning by tenant or team is simpler and sufficient.

Caching posting lists

Some TSDBs (including VictoriaMetrics) cache frequently-accessed posting lists in memory. If the same high-cardinality query runs repeatedly (e.g., a dashboard auto-refreshing every 30 seconds), the posting list intersection result is cached and subsequent queries skip the index entirely.

This doesn’t solve the cold-query problem (first query is still slow), but it makes dashboards fast for repeated access patterns. VictoriaMetrics exposes the cache hit rate via vm_cache_requests_total and vm_cache_misses_total — monitor these to know whether caching is helping.

Summary

StrategyWhen to useReduces cardinality?Reduces query cost?
Remove dimension from labelsDimension has no aggregate valueYesYes
Drop unused labels at ingestionLabels never queriedYesYes
Short retention for high-card metricsNeed per-entity data temporarilyLimits growthLimits growth
Recording rulesDashboards repeatedly aggregate same dataNoYes
Partition by tenant/teamIrreducible cardinality from org structurePer-partitionPer-partition
Shard the index (cluster mode)Single-node index exceeds memoryPer-nodeAdds latency
Posting list cachingRepeated queries on hot dataNoYes (after first hit)

The most impactful strategies are at the top of this list. Get the schema right — keep only labels you aggregate over — and most cardinality problems never arise.

See also

  • VictoriaMetrics — architecture, compression, and operational details
  • TSDB Internals notebook — interactive exploration of compression and the inverted index
  • LSM Compaction — write amplification and how partitioning limits it