Elasticsearch8000主分片治理实战:p99从5s降到200ms-小栋博客

索引 2000+ shard 8000+,集群启动 25 分钟,小查询 5 秒。本文讲透 ES 索引设计 8 大坑:索引爆炸 / shard 瞎配 / 动态字段 / norms / refresh / source / 查询语法 / 写入路径,附 ILM 生命周期 + shrink + forcemerge + Bulk + best_compression 完整优化方案。

线上日志检索系统跑了半年,索引数 2000+,主分片 8000+,集群一启动就 yellow 半天,小查询动辄 5 秒。复盘发现是当初索引设计太"灵活" —— 每个业务一个 daily 索引、shard 数没限制、mapping 字段无脑 dynamic。本文把生产 ES 索引设计的踩坑记录写下来,附实测优化前后数据。

故障现象

$ curl es-master:9200/_cluster/health?pretty
{
  "cluster_name": "logging-cluster",
  "status": "yellow",
  "number_of_nodes": 9,
  "active_primary_shards": 8123,    ← 8000+ 主分片
  "active_shards": 14000,            ← 加副本接近 1.5w
  "unassigned_shards": 200,
  "delayed_unassigned_shards": 0,
  "active_shards_percent_as_number": 98.59
}

$ curl es-master:9200/_cat/indices?v
health  index                          docs   size
green   logs-app-2024.01.01            5GB    5GB
green   logs-app-2024.01.02            5GB    5GB
...
green   logs-app-2024.07.30            5GB    5GB
yellow  logs-app-2024.07.31            5GB    5GB
# 单纯一个业务就 200 多个 daily 索引

问题 1:索引过多 → 集群元数据膨胀

ES 每个索引、每个 shard 在 master 节点都有元数据。索引多了:

集群状态 cluster state 几十 MB,master 节点 GC 压力大
新增节点同步 cluster state 几分钟
分片分配算法变慢,reroute 一次 1 分钟
启动恢复时间长(每个 shard 都要 init)

# 看集群状态大小
$ curl es-master:9200/_cluster/state?filter_path=metadata.indices | wc -c
84738592   # 80MB,过大

# 单 shard 元数据约 ~10KB,8000 shard = 80MB+
# 推荐:每个节点不超过 600 个 shard
# 一般集群总 shard 数 = 节点数 × 600,我们 9 节点应控制在 5400 以内

解决方案 1:ILM (Index Lifecycle Management)

// 定义生命周期策略
PUT _ilm/policy/logs_policy
{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_age": "1d",
            "max_size": "50gb",
            "max_primary_shard_size": "30gb"
          },
          "set_priority": { "priority": 100 }
        }
      },
      "warm": {
        "min_age": "3d",
        "actions": {
          "shrink": { "number_of_shards": 1 },     // 合并 shard
          "forcemerge": { "max_num_segments": 1 },  // 段合并
          "allocate": { "include": { "data": "warm" } },
          "set_priority": { "priority": 50 }
        }
      },
      "cold": {
        "min_age": "14d",
        "actions": {
          "allocate": { "include": { "data": "cold" } },
          "freeze": {}                              // 冻结索引
        }
      },
      "delete": {
        "min_age": "30d",
        "actions": { "delete": {} }
      }
    }
  }
}

// 索引模板挂这个策略
PUT _index_template/logs_template
{
  "index_patterns": ["logs-*"],
  "template": {
    "settings": {
      "number_of_shards": 3,
      "number_of_replicas": 1,
      "index.lifecycle.name": "logs_policy",
      "index.lifecycle.rollover_alias": "logs-app"
    }
  }
}

// 建初始索引 + 别名
PUT logs-app-000001
{
  "aliases": {
    "logs-app": { "is_write_index": true }
  }
}

// 业务写入用 alias,ES 自动滚动到下一个索引
POST logs-app/_doc
{"message": "...", "ts": "..."}

问题 2:shard 数量瞎设

常见错误:
- 每个 daily 索引设 5 个 shard("以防扩容")
- 实际单索引 5GB,5 个 shard = 1GB / shard,效率低
- shard 数 × 索引数 → 集群被淹

最佳实践:
- 单 shard 控制在 20-50GB(日志类)/ 30GB(检索类)
- 单 index 文档数 < 20 亿(Lucene 限制 21 亿)
- 每个节点 < 600 shard
- 估算:总数据量 / 30GB = shard 数

# 一个 _shrink API 实战
# 老索引 logs-app-2024.07 有 5 shard,合并成 1 shard
# 必须先设只读 + collocate
PUT logs-app-2024.07/_settings
{
  "index.routing.allocation.require._name": "node-1",  // 全部分片移到同节点
  "index.blocks.write": true
}

POST logs-app-2024.07/_shrink/logs-app-2024.07-shrunk
{
  "settings": {
    "index.number_of_shards": 1,
    "index.number_of_replicas": 1,
    "index.routing.allocation.require._name": null,
    "index.blocks.write": false
  }
}

# shrink 后:
# - shard 数: 5 → 1
# - 元数据负担降低
# - 段合并后查询变快

问题 3:动态 mapping 字段爆炸

问题:每个文档自由扩字段,半年后某索引有 8000+ 字段
ES 默认 index.mapping.total_fields.limit = 1000
超过就报错 "Limit of total fields [1000] in index [...] has been exceeded"
强行调高:每个 shard 加载 mapping 慢,堆内存吃紧

PUT logs-app-*/_settings
{
  "index.mapping.total_fields.limit": 5000,
  "index.mapping.depth.limit": 20,
  "index.mapping.nested_fields.limit": 100
}

// 但这是治标。治本是把 mapping 收紧:
PUT _index_template/logs_template
{
  "index_patterns": ["logs-*"],
  "template": {
    "settings": {
      "index.mapping.total_fields.limit": 1000
    },
    "mappings": {
      "dynamic": "strict",                  // 关闭动态映射,未知字段拒绝
      "properties": {
        "@timestamp": { "type": "date" },
        "level":      { "type": "keyword" },
        "service":    { "type": "keyword" },
        "host":       { "type": "keyword" },
        "message":    { "type": "text",  "norms": false },
        "fields": {                          // 动态字段统一塞这里
          "type": "object",
          "dynamic": true
        }
      }
    }
  }
}

问题 4:不必要的 fielddata / norms

// text 字段默认开 norms,占内存
// 日志场景大多不需要 BM25 相关度,可以关
PUT logs-template
{
  "mappings": {
    "properties": {
      "message": {
        "type": "text",
        "norms": false,           // 关掉,节省内存
        "index_options": "freqs"  // 只存词频,不存位置
      },
      "service": {
        "type": "keyword",
        "eager_global_ordinals": true,    // 频繁聚合的字段预加载 ordinals
        "doc_values": true
      }
    }
  }
}

// 不参与搜索的字段:index: false,节省存储
"user_agent": {
    "type": "keyword",
    "index": false,
    "doc_values": false       // 也不能聚合
}

问题 5:刷新间隔太小

// ES 默认 1 秒刷新一次,每次刷新生成新 segment
// 高写入场景 segment 暴涨,合并占用大量 CPU/IO
PUT logs-*/_settings
{
  "index.refresh_interval": "30s",    // 改 30 秒
  "index.translog.durability": "async",
  "index.translog.sync_interval": "30s",
  "index.translog.flush_threshold_size": "1gb"
}

// 历史索引(不再写入)
PUT logs-app-2024.06/_settings
{
  "index.refresh_interval": "-1",    // 关闭刷新
  "index.number_of_replicas": 0      // 备份后副本归零节省空间
}

问题 6:_source 全存

// 默认 _source 存原始文档 JSON,占空间
// 如果不需要还原原文(纯检索),可以关
PUT logs-*/_mappings
{
  "_source": {
    "enabled": false      // 危险:无法 reindex / update / highlight
  }
}

// 折中:只存关键字段
PUT logs-*/_mappings
{
  "_source": {
    "includes": ["@timestamp", "level", "service", "message"]
  }
}

// 推荐:压缩 _source
PUT logs-*/_settings
{
  "index.codec": "best_compression"    // ZSTD 压缩,慢但省 30-50%
}

问题 7:查询语法低效

查询模式               性能(40亿 doc 集群)
=====================================
match_phrase           300ms
wildcard *foo*         15s        ← 慎用
regexp                 30s        ← 极慎用
prefix                 200ms      (用 keyword)
exists                 50ms       (用 _doc_count)
script                 慢且无法缓存
terms (≤1000)          100ms
terms (10万)           5s

// 错:用 wildcard 找以 error 开头
{
  "query": { "wildcard": { "message": "*error*" } }
}

// 对:用 match phrase + token,索引时正确分词
{
  "query": { "match_phrase": { "message": "error" } }
}

// 错:在大字段上用 cardinality 求 unique
{
  "aggs": { "unique_users": { "cardinality": { "field": "user_id" } } }
}
// cardinality 用 HyperLogLog,默认精度 3000,大基数误差大

// 对:加 precision_threshold
{
  "aggs": { "unique_users": {
    "cardinality": {
      "field": "user_id",
      "precision_threshold": 40000     // 越大越准,内存越多
    }
  }}
}

问题 8:写入路径不优化

// 错:逐条 index 写入
for (LogEntry log : entries) {
    esClient.index(idx -> idx.index("logs-app").document(log));
}
// 每条 1 个 HTTP 请求,慢

// 对:bulk 写入
BulkRequest.Builder br = new BulkRequest.Builder();
for (LogEntry log : entries) {
    br.operations(op -> op.index(idx -> idx.index("logs-app").document(log)));
}
BulkResponse resp = esClient.bulk(br.build());

// 进一步:用 BulkProcessor 自动批+并发
BulkProcessor bulkProcessor = BulkProcessor.builder((req, listener) -> ...)
    .setBulkActions(5000)              // 5000 条触发
    .setBulkSize(new ByteSizeValue(5, ByteSizeUnit.MB))
    .setFlushInterval(TimeValue.timeValueSeconds(5))
    .setConcurrentRequests(4)
    .setBackoffPolicy(BackoffPolicy.exponentialBackoff())
    .build();

bulkProcessor.add(new IndexRequest("logs-app").source(log));

实战优化前后

指标                     优化前      优化后
======================================
索引数                  2000+        300
主分片数                8000+        1200
集群启动恢复时间        25 分钟      5 分钟
master 节点 GC time     800ms       50ms
单次查询 p99            5s          200ms
写入吞吐               5w docs/s   25w docs/s
磁盘占用                30TB        18TB(关 norms + 压缩)
节点数                  9           9 (硬件不变)

核对清单

用 ILM 自动管理索引生命周期
shard 数 = 数据量 / 30GB,不要拍脑袋
每节点 shard 数 ≤ 600
mapping 用 strict + 字段 limit
不参与搜索的字段 index: false
不需要相关度的 text 关 norms
历史索引关刷新 + 副本归零
查询禁用 wildcard / regexp,改 match_phrase
写入用 BulkProcessor
codec 用 best_compression

ES 集群运维的核心是"减法":索引少、shard 少、字段少、查询简单。我们这次优化把元数据负担从天文数字降到合理水位,同样硬件下吞吐量翻 5 倍。如果你的 ES 集群也有 5000+ shard,值得花一周时间彻底重构索引设计。

—— 别看了 · 2026

声明：本站所有文章，如无特殊说明或标注，均为本站原创发布。任何个人或组织，在未征得本站同意时，禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益，可联系我们进行处理邮箱1846861578@qq.com。

{{userData.name}}已认证

Elasticsearch 8000 主分片治理实战:p99 从 5s 降到 200ms