线上 Prometheus 内存吃到 64GB 还在涨,查询动辄 30 秒,Alertmanager 经常评估失败。复盘下来是经典的高基数(high cardinality)问题:几个 label 把时间序列炸成了千万级。本文实录定位 + 治理 + 防复发的全过程,讲透 cardinality 是怎么来的、怎么测、怎么砍。
什么是 cardinality
Prometheus 的一条时间序列 = metric_name + 所有 label 的组合
例如:
http_requests_total{method="GET", status="200", path="/api/order/12345"}
http_requests_total{method="GET", status="200", path="/api/order/12346"}
http_requests_total{method="GET", status="200", path="/api/order/12347"}
如果 path 把订单 ID 也带上,每个订单一条 series:
100 万订单 × 5 状态码 × 4 方法 = 2000 万 series
每条 series 占内存大约 200-500 字节(取决于 chunk 数)
2000 万 × 300B = 6GB 内存,光是 head series
长时间累积会更多
定位现象
# 1. 看 Prometheus 自身 metrics
$ curl -s http://prometheus:9090/api/v1/query?query=prometheus_tsdb_head_series
# 输出 prometheus_tsdb_head_series{instance="prometheus:9090"} 18500000
# 1850 万 series!严重超标(健康值 < 200 万)
# 2. 看每个 job 的 series 数
$ curl -sG http://prometheus:9090/api/v1/query --data-urlencode \
'query=count by (job)({__name__=~".+"})'
# 结果:
# {job="app-backend"} 12300000 ← 罪魁
# {job="kube-state-metrics"} 3200000
# {job="node-exporter"} 1500000
# 3. 看哪个 metric 的 series 最多
$ curl -sG http://prometheus:9090/api/v1/query --data-urlencode \
'query=topk(20, count by (__name__)({__name__=~".+"}))'
# 结果:
# http_request_duration_seconds_bucket 4800000
# http_requests_total 3200000
# db_query_duration_seconds_bucket 2100000
# 4. 看 metric 的 label 基数
$ curl -sG http://prometheus:9090/api/v1/query --data-urlencode \
'query=count(count by (path) (http_requests_total))'
# 输出 985000 ← path label 有 98 万个不同值!
常见高基数陷阱
1. 把 ID 当 label
user_login_total{user_id="12345"} ← user_id 百万级
修复:不放 label,放 trace / log
2. 把 URL 当 label(没 templating)
http_requests{path="/order/12345/item/678"}
修复:模板化 path="/order/:id/item/:itemId"
3. 把错误信息当 label
db_error{message="connection refused on tcp 10.0.1.23:5432"}
修复:用错误码 / 类型,不用 message
4. 用 timestamp / 随机字符串当 label
job_run{job="backup", run_id="abc-1234-..."} ← 每次跑都新 series
修复:run_id 进日志,metric 只保留 job name
5. 把内部 trace_id 暴露成 label
修复:不要暴露 trace_id
6. 多维度组合爆炸
http_requests{method, status, path, user_agent, region, ...}
= 5 × 50 × 100 × 1000 × 10 = 2500w
修复:删掉低价值 label
7. histogram bucket 太多
buckets = [0.005, 0.01, 0.025, 0.05, ..., 100] 18 个 bucket
配上 50 个 path × 5 status = 50 × 5 × 18 = 4500 条 series
修复:bucket 数控制在 10 个以内
定位工具:tsdb 分析
# promtool 分析 head block
$ promtool tsdb analyze /prometheus/data /prometheus/data/01H...
# 输出:
Highest cardinality labels:
4823000 path
985000 trace_id
122000 user_id
32000 instance
50 status_code
5 method
Highest cardinality metric names:
http_request_duration_seconds_bucket 4823000
http_requests_total 1500000
active_connections 8000
Label pairs with highest cardinality:
path=/api/v1/order/12345 1500
path=/api/v1/user/profile/abc 1200
...
# 这下问题清楚了:path label 把订单 ID 带进来,造成 4 百万 series
修复 1:应用层 template path
// Spring Boot:用 Micrometer,自动把 path 模板化
// 错(不规范的 metric 命名)
@RestController
public class OrderController {
private final MeterRegistry registry;
@GetMapping("/api/order/{id}")
public Order getOrder(@PathVariable Long id, HttpServletRequest req) {
Counter.builder("http.requests")
.tag("path", req.getRequestURI()) // /api/order/12345 ← 错!
.register(registry).increment();
...
}
}
// 对:用 Spring 默认 metric(http.server.requests)
// 它会自动把 path 替换为模板 /api/order/{id}
@RestController
public class OrderController {
@GetMapping("/api/order/{id}")
public Order getOrder(@PathVariable Long id) {
// 不需要手动埋,Micrometer auto-config 已处理
// metric: http_server_requests_seconds{uri="/api/order/{id}",method="GET",status="200"}
}
}
// application.yml
management:
metrics:
distribution:
percentiles-histogram:
http.server.requests: true
maximum-expected-value:
http.server.requests: 10s
// Go:同样的概念,用 mux 的模板
import "github.com/gorilla/mux"
func MetricsMiddleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
route := mux.CurrentRoute(r)
path, _ := route.GetPathTemplate() // /api/order/{id}
rw := &responseWriter{ResponseWriter: w, status: 200}
next.ServeHTTP(rw, r)
httpRequests.WithLabelValues(
r.Method,
strconv.Itoa(rw.status),
path, // 用模板,不是实际 path
).Inc()
})
}
修复 2:relabeling 把脏 label drop 掉
# prometheus.yml:scrape_config 里加 metric_relabel_configs
scrape_configs:
- job_name: app-backend
static_configs:
- targets: ['app:8080']
metric_relabel_configs:
# 删掉 trace_id label(高基数,本来就不该暴露)
- regex: 'trace_id'
action: labeldrop
# path 里如果还有数字 ID,统一替换成 :id
- source_labels: [path]
regex: '(/api/.*/)([0-9]+)(/?.*)'
target_label: path
replacement: '${1}:id${3}'
# 整个 metric drop(暂时不要某个高基数 metric)
- source_labels: [__name__]
regex: 'old_high_cardinality_metric'
action: drop
# 限制 status_code 只保留几个值,其他变成 "other"
- source_labels: [status]
regex: '(2|3|4|5)[0-9]{2}'
target_label: status
replacement: '${1}xx'
修复 3:histogram 改 native histogram 或 summary
# Prometheus 2.40+ 支持 native histogram
# 一个 metric 自带分布,不需要每个 bucket 一条 series
# 应用侧(Go)
import "github.com/prometheus/client_golang/prometheus"
var httpDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request latency",
NativeHistogramBucketFactor: 1.1, // 自动 bucket
NativeHistogramMaxBucketNumber: 100,
NativeHistogramMinResetDuration: time.Hour,
},
[]string{"method", "path", "status"},
)
# Prometheus 抓取要开启
scrape_configs:
- job_name: app
feature_set: native-histograms
static_configs:
- targets: ['app:8080']
# 启动 Prom 时:
$ prometheus --enable-feature=native-histograms
修复 4:Cardinality 限制(强制门禁)
# prometheus.yml
global:
scrape_interval: 30s
evaluation_interval: 30s
# 限制每个 metric 的 series 数
scrape_configs:
- job_name: app
sample_limit: 50000 # 单次 scrape 样本上限
target_limit: 100 # 单个 job target 上限
label_limit: 30 # 单 metric label 数上限
label_name_length_limit: 200
label_value_length_limit: 200
# 整体内存限制
storage:
tsdb:
retention.time: 15d
retention.size: 200GB
max-block-duration: 6h
# 启动参数
--query.max-samples=50000000
--query.timeout=2m
# Prometheus 启动后能看到告警
level=warn ts=2024-03-15T... msg="Some samples were dropped"
target=app:8080 reason="sample_limit exceeded"
# 这样高基数 metric 会被直接拒绝,逼业务方修
修复 5:用 VictoriaMetrics 替代
# VictoriaMetrics 兼容 Prometheus 协议,但内存效率高得多
# 同样 2000w series:
# Prometheus:60GB 内存
# VictoriaMetrics:8GB 内存
# docker-compose 起单节点 VM
services:
victoria-metrics:
image: victoriametrics/victoria-metrics:v1.97.0
ports:
- "8428:8428"
volumes:
- vm-data:/storage
command:
- '--storageDataPath=/storage'
- '--retentionPeriod=15d'
- '--memory.allowedPercent=70'
- '--maxLabelsPerTimeseries=30'
- '--search.maxQueryLen=16384'
# Grafana 数据源 type 选 Prometheus,URL 填 http://victoria-metrics:8428
查询优化
# 错:在 PromQL 里用大范围 + 多 label
# 这个查询要扫 100w series × 2 周 = 内存爆炸
sum(rate(http_requests_total[5m])) by (path, user_id, region)
# 对:先聚合再过滤
sum(rate(http_requests_total[5m])) by (path)
# 用 recording rule 提前预计算高频查询
# rules.yml
groups:
- name: aggregations
interval: 30s
rules:
- record: job:http_requests:rate5m
expr: sum by (job, path, status) (rate(http_requests_total[5m]))
- record: job:http_request_duration:p99
expr: histogram_quantile(0.99, sum by (job, path, le) (rate(http_request_duration_seconds_bucket[5m])))
# 查询时直接用预计算结果
sum(job:http_requests:rate5m{job="app"})
持续监控 cardinality
# 把 cardinality 当成 KPI 监控
- alert: PrometheusHighCardinality
expr: prometheus_tsdb_head_series > 5000000
for: 10m
labels: { severity: warning }
annotations:
summary: 'Prometheus head series > 5M, check cardinality'
- alert: PrometheusMetricHighCardinality
expr: |
topk(1, count by (__name__) ({__name__=~".+"})) > 200000
for: 10m
labels: { severity: warning }
annotations:
summary: 'Single metric has > 200k series: {{ $labels.__name__ }}'
# 周度报告:cardinality top 10
- name: weekly_cardinality_report
rules:
- record: weekly:metric_cardinality:top10
expr: topk(10, count by (__name__) ({__name__=~".+"}))
治理后效果
指标 治理前 治理后 变化
================================================
head series 1850w 180w -90%
Prometheus 内存 64GB 8GB -88%
查询 p99 延迟 28s 850ms -97%
TSDB 磁盘 420GB 45GB -89%
告警评估失败率 15% 0%
Grafana dashboard 加载 8-15s 1-2s
避坑清单
- ID / token / UUID / trace_id 永远不进 label
- URL path 必须模板化,不能带数字 ID
- error message 不进 label,用 error_code 或 error_type
- label 数控制在 10 个以内,bucket 数 10 个以内
- scrape_config 加 sample_limit / label_limit 强制门禁
- 新加 metric 必须经过 cardinality review
- 常驻 head_series 监控,告警阈值按业务规模设
- topN cardinality 查询定期跑(promtool tsdb analyze)
- 查询优化:recording rule 预计算 + 控制查询时间范围
- 规模过大考虑 VictoriaMetrics / Mimir / Thanos
反思
Prometheus 是个老实的工具,你给它什么 label 它就存什么 series,没主动防御。第一次摸 Prometheus 的团队几乎都会踩 cardinality 坑,然后慢慢学会"什么 label 该加,什么不该加"。这次治理最大的收益不是性能恢复,是建立了"加 label 之前先想 cardinality"的团队习惯。Prometheus 的设计哲学:metric 是高频聚合数据,trace / log 才是高基数明细。把对应的事放在对应的地方,才能跑得长久。
—— 别看了 · 2026