Prometheus高基数治理实战:1850万series砍到180万-小栋博客

Prometheus 内存吃 64GB,head series 1850 万,查询 30 秒,Alertmanager 评估失败。本文实录高基数定位 + 7 个常见陷阱 + 应用层 path 模板化 + relabeling + native histogram + sample_limit 门禁 + VictoriaMetrics 替代方案,内存降到 8GB。

线上 Prometheus 内存吃到 64GB 还在涨,查询动辄 30 秒,Alertmanager 经常评估失败。复盘下来是经典的高基数(high cardinality)问题:几个 label 把时间序列炸成了千万级。本文实录定位 + 治理 + 防复发的全过程,讲透 cardinality 是怎么来的、怎么测、怎么砍。

什么是 cardinality

Prometheus 的一条时间序列 = metric_name + 所有 label 的组合
例如:
  http_requests_total{method="GET", status="200", path="/api/order/12345"}
  http_requests_total{method="GET", status="200", path="/api/order/12346"}
  http_requests_total{method="GET", status="200", path="/api/order/12347"}

如果 path 把订单 ID 也带上,每个订单一条 series:
  100 万订单 × 5 状态码 × 4 方法 = 2000 万 series

每条 series 占内存大约 200-500 字节(取决于 chunk 数)
2000 万 × 300B = 6GB 内存,光是 head series
长时间累积会更多

定位现象

# 1. 看 Prometheus 自身 metrics
$ curl -s http://prometheus:9090/api/v1/query?query=prometheus_tsdb_head_series
# 输出 prometheus_tsdb_head_series{instance="prometheus:9090"} 18500000
# 1850 万 series!严重超标(健康值 < 200 万)

# 2. 看每个 job 的 series 数
$ curl -sG http://prometheus:9090/api/v1/query --data-urlencode \
    'query=count by (job)({__name__=~".+"})'
# 结果:
#   {job="app-backend"} 12300000   ← 罪魁
#   {job="kube-state-metrics"} 3200000
#   {job="node-exporter"} 1500000

# 3. 看哪个 metric 的 series 最多
$ curl -sG http://prometheus:9090/api/v1/query --data-urlencode \
    'query=topk(20, count by (__name__)({__name__=~".+"}))'
# 结果:
#   http_request_duration_seconds_bucket  4800000
#   http_requests_total                    3200000
#   db_query_duration_seconds_bucket       2100000

# 4. 看 metric 的 label 基数
$ curl -sG http://prometheus:9090/api/v1/query --data-urlencode \
    'query=count(count by (path) (http_requests_total))'
# 输出 985000  ← path label 有 98 万个不同值!

常见高基数陷阱

1. 把 ID 当 label
   user_login_total{user_id="12345"}        ← user_id 百万级
   修复:不放 label,放 trace / log

2. 把 URL 当 label(没 templating)
   http_requests{path="/order/12345/item/678"}
   修复:模板化 path="/order/:id/item/:itemId"

3. 把错误信息当 label
   db_error{message="connection refused on tcp 10.0.1.23:5432"}
   修复:用错误码 / 类型,不用 message

4. 用 timestamp / 随机字符串当 label
   job_run{job="backup", run_id="abc-1234-..."}  ← 每次跑都新 series
   修复:run_id 进日志,metric 只保留 job name

5. 把内部 trace_id 暴露成 label
   修复:不要暴露 trace_id

6. 多维度组合爆炸
   http_requests{method, status, path, user_agent, region, ...}
   = 5 × 50 × 100 × 1000 × 10 = 2500w
   修复:删掉低价值 label

7. histogram bucket 太多
   buckets = [0.005, 0.01, 0.025, 0.05, ..., 100]  18 个 bucket
   配上 50 个 path × 5 status = 50 × 5 × 18 = 4500 条 series
   修复:bucket 数控制在 10 个以内

定位工具:tsdb 分析

# promtool 分析 head block
$ promtool tsdb analyze /prometheus/data /prometheus/data/01H...

# 输出:
Highest cardinality labels:
  4823000 path
   985000 trace_id
   122000 user_id
    32000 instance
       50 status_code
        5 method

Highest cardinality metric names:
  http_request_duration_seconds_bucket  4823000
  http_requests_total                   1500000
  active_connections                       8000

Label pairs with highest cardinality:
  path=/api/v1/order/12345                 1500
  path=/api/v1/user/profile/abc            1200
  ...

# 这下问题清楚了:path label 把订单 ID 带进来,造成 4 百万 series

修复 1:应用层 template path

// Spring Boot:用 Micrometer,自动把 path 模板化
// 错(不规范的 metric 命名)
@RestController
public class OrderController {
    private final MeterRegistry registry;

    @GetMapping("/api/order/{id}")
    public Order getOrder(@PathVariable Long id, HttpServletRequest req) {
        Counter.builder("http.requests")
            .tag("path", req.getRequestURI())   // /api/order/12345 ← 错!
            .register(registry).increment();
        ...
    }
}

// 对:用 Spring 默认 metric(http.server.requests)
// 它会自动把 path 替换为模板 /api/order/{id}
@RestController
public class OrderController {
    @GetMapping("/api/order/{id}")
    public Order getOrder(@PathVariable Long id) {
        // 不需要手动埋,Micrometer auto-config 已处理
        // metric: http_server_requests_seconds{uri="/api/order/{id}",method="GET",status="200"}
    }
}

// application.yml
management:
  metrics:
    distribution:
      percentiles-histogram:
        http.server.requests: true
      maximum-expected-value:
        http.server.requests: 10s

// Go:同样的概念,用 mux 的模板
import "github.com/gorilla/mux"

func MetricsMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        route := mux.CurrentRoute(r)
        path, _ := route.GetPathTemplate()  // /api/order/{id}

        rw := &responseWriter{ResponseWriter: w, status: 200}
        next.ServeHTTP(rw, r)

        httpRequests.WithLabelValues(
            r.Method,
            strconv.Itoa(rw.status),
            path,  // 用模板,不是实际 path
        ).Inc()
    })
}

修复 2:relabeling 把脏 label drop 掉

# prometheus.yml:scrape_config 里加 metric_relabel_configs
scrape_configs:
  - job_name: app-backend
    static_configs:
      - targets: ['app:8080']
    metric_relabel_configs:
      # 删掉 trace_id label(高基数,本来就不该暴露)
      - regex: 'trace_id'
        action: labeldrop

      # path 里如果还有数字 ID,统一替换成 :id
      - source_labels: [path]
        regex: '(/api/.*/)([0-9]+)(/?.*)'
        target_label: path
        replacement: '${1}:id${3}'

      # 整个 metric drop(暂时不要某个高基数 metric)
      - source_labels: [__name__]
        regex: 'old_high_cardinality_metric'
        action: drop

      # 限制 status_code 只保留几个值,其他变成 "other"
      - source_labels: [status]
        regex: '(2|3|4|5)[0-9]{2}'
        target_label: status
        replacement: '${1}xx'

修复 3:histogram 改 native histogram 或 summary

# Prometheus 2.40+ 支持 native histogram
# 一个 metric 自带分布,不需要每个 bucket 一条 series

# 应用侧(Go)
import "github.com/prometheus/client_golang/prometheus"

var httpDuration = prometheus.NewHistogramVec(
    prometheus.HistogramOpts{
        Name: "http_request_duration_seconds",
        Help: "HTTP request latency",
        NativeHistogramBucketFactor: 1.1,        // 自动 bucket
        NativeHistogramMaxBucketNumber: 100,
        NativeHistogramMinResetDuration: time.Hour,
    },
    []string{"method", "path", "status"},
)

# Prometheus 抓取要开启
scrape_configs:
  - job_name: app
    feature_set: native-histograms
    static_configs:
      - targets: ['app:8080']

# 启动 Prom 时:
$ prometheus --enable-feature=native-histograms

修复 4:Cardinality 限制(强制门禁)

# prometheus.yml
global:
  scrape_interval: 30s
  evaluation_interval: 30s

# 限制每个 metric 的 series 数
scrape_configs:
  - job_name: app
    sample_limit: 50000              # 单次 scrape 样本上限
    target_limit: 100                 # 单个 job target 上限
    label_limit: 30                   # 单 metric label 数上限
    label_name_length_limit: 200
    label_value_length_limit: 200

# 整体内存限制
storage:
  tsdb:
    retention.time: 15d
    retention.size: 200GB
    max-block-duration: 6h

# 启动参数
--query.max-samples=50000000
--query.timeout=2m

# Prometheus 启动后能看到告警
level=warn ts=2024-03-15T... msg="Some samples were dropped"
   target=app:8080 reason="sample_limit exceeded"

# 这样高基数 metric 会被直接拒绝,逼业务方修

修复 5:用 VictoriaMetrics 替代

# VictoriaMetrics 兼容 Prometheus 协议,但内存效率高得多
# 同样 2000w series:
#   Prometheus:60GB 内存
#   VictoriaMetrics:8GB 内存

# docker-compose 起单节点 VM
services:
  victoria-metrics:
    image: victoriametrics/victoria-metrics:v1.97.0
    ports:
      - "8428:8428"
    volumes:
      - vm-data:/storage
    command:
      - '--storageDataPath=/storage'
      - '--retentionPeriod=15d'
      - '--memory.allowedPercent=70'
      - '--maxLabelsPerTimeseries=30'
      - '--search.maxQueryLen=16384'

# Grafana 数据源 type 选 Prometheus,URL 填 http://victoria-metrics:8428

查询优化

# 错:在 PromQL 里用大范围 + 多 label
# 这个查询要扫 100w series × 2 周 = 内存爆炸
sum(rate(http_requests_total[5m])) by (path, user_id, region)

# 对:先聚合再过滤
sum(rate(http_requests_total[5m])) by (path)

# 用 recording rule 提前预计算高频查询
# rules.yml
groups:
  - name: aggregations
    interval: 30s
    rules:
      - record: job:http_requests:rate5m
        expr: sum by (job, path, status) (rate(http_requests_total[5m]))
      - record: job:http_request_duration:p99
        expr: histogram_quantile(0.99, sum by (job, path, le) (rate(http_request_duration_seconds_bucket[5m])))

# 查询时直接用预计算结果
sum(job:http_requests:rate5m{job="app"})

持续监控 cardinality

# 把 cardinality 当成 KPI 监控
- alert: PrometheusHighCardinality
  expr: prometheus_tsdb_head_series > 5000000
  for: 10m
  labels: { severity: warning }
  annotations:
    summary: 'Prometheus head series > 5M, check cardinality'

- alert: PrometheusMetricHighCardinality
  expr: |
    topk(1, count by (__name__) ({__name__=~".+"})) > 200000
  for: 10m
  labels: { severity: warning }
  annotations:
    summary: 'Single metric has > 200k series: {{ $labels.__name__ }}'

# 周度报告:cardinality top 10
- name: weekly_cardinality_report
  rules:
    - record: weekly:metric_cardinality:top10
      expr: topk(10, count by (__name__) ({__name__=~".+"}))

治理后效果

指标                  治理前       治理后        变化
================================================
head series           1850w        180w         -90%
Prometheus 内存       64GB         8GB          -88%
查询 p99 延迟         28s          850ms        -97%
TSDB 磁盘             420GB        45GB         -89%
告警评估失败率        15%          0%
Grafana dashboard 加载 8-15s        1-2s

避坑清单

ID / token / UUID / trace_id 永远不进 label
URL path 必须模板化,不能带数字 ID
error message 不进 label,用 error_code 或 error_type
label 数控制在 10 个以内,bucket 数 10 个以内
scrape_config 加 sample_limit / label_limit 强制门禁
新加 metric 必须经过 cardinality review
常驻 head_series 监控,告警阈值按业务规模设
topN cardinality 查询定期跑(promtool tsdb analyze)
查询优化:recording rule 预计算 + 控制查询时间范围
规模过大考虑 VictoriaMetrics / Mimir / Thanos

反思

Prometheus 是个老实的工具,你给它什么 label 它就存什么 series,没主动防御。第一次摸 Prometheus 的团队几乎都会踩 cardinality 坑,然后慢慢学会"什么 label 该加,什么不该加"。这次治理最大的收益不是性能恢复,是建立了"加 label 之前先想 cardinality"的团队习惯。Prometheus 的设计哲学:metric 是高频聚合数据,trace / log 才是高基数明细。把对应的事放在对应的地方,才能跑得长久。

—— 别看了 · 2026

声明：本站所有文章，如无特殊说明或标注，均为本站原创发布。任何个人或组织，在未征得本站同意时，禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益，可联系我们进行处理邮箱1846861578@qq.com。

{{userData.name}}已认证

Prometheus 高基数治理实战:1850 万 series 砍到 180 万