2024 年我们的搜索集群:Elasticsearch 7.10,6 个数据节点,日增 800GB 日志,3 个月数据总量 50TB。突然某周开始,查询延迟从 200ms 飙到 8s,JVM 频繁 Full GC,节点掉线,集群 yellow → red,搜索接口超时率 30%。投了三周做集群治理,延迟降到 80ms,节点稳定,集群常驻 green。本文复盘 Elasticsearch 集群性能治理的完整实战,覆盖索引设计、分片规划、JVM 调优、查询优化、冷热分层、监控告警。
事故现场
集群:Elasticsearch 7.10
节点:6 个 data node,3 个 master node
机型:32 核 / 128GB 内存 / 4TB NVMe SSD
日志量:800GB/天,保留 90 天
索引模板:logs-{service}-{date} 按天滚动
分片配置:每个索引 5 primary + 1 replica
故障表现:
- 搜索延迟从 P50 200ms / P99 1s → P50 1.5s / P99 8s
- 节点频繁 Full GC,每次 5-10 秒
- 集群状态:green → yellow → red(切换 master)
- 部分节点 disk 95%+,read-only 状态
- Kibana 看板加载 2 分钟,经常 timeout
根因排查(运行 _cat APIs):
1. _cat/indices?v: 8000+ 个索引,小索引特别多(< 100MB)
2. _cat/shards?v: 6w+ 个分片(数据节点平均 1w+ 分片)
3. _cat/nodes?v: heap 90%+,old gen 满
4. _nodes/stats: GC 频繁,fielddata 占 30GB
关键问题:
- 小索引 + 多分片 = 集群元数据爆炸
- mapping 没规范,fielddata 大量加载
- 没有冷热分层,SSD 装老数据浪费
- 没限制单查询,大 size + 深翻页拖垮集群
修复 1:索引设计 + 分片规划
# 1. 合并小索引(按周/月而非天)
# 不好:logs-app-2024-01-01, logs-app-2024-01-02, ... (365 索引/年)
# 好:logs-app-2024-w01, logs-app-2024-w02, ... (52 索引/年)
# 2. 单分片大小目标 20-50GB
# 计算:日志 800GB/天 → 1 索引 800GB → 5 primary = 160GB/分片(太大)
# 调整:日 800GB → 20 primary = 40GB/分片(合理)
# 3. 索引模板(ILM rollover)
PUT _ilm/policy/logs_policy
{
"policy": {
"phases": {
"hot": {
"actions": {
"rollover": {
"max_age": "1d",
"max_size": "200gb",
"max_docs": 200000000
}
}
},
"warm": {
"min_age": "2d",
"actions": {
"allocate": {
"include": { "data_tier": "warm" }
},
"forcemerge": { "max_num_segments": 1 },
"shrink": { "number_of_shards": 1 }
}
},
"cold": {
"min_age": "30d",
"actions": {
"allocate": {
"include": { "data_tier": "cold" }
},
"freeze": {}
}
},
"delete": {
"min_age": "90d",
"actions": { "delete": {} }
}
}
}
}
# 索引模板
PUT _index_template/logs_template
{
"index_patterns": ["logs-*"],
"template": {
"settings": {
"number_of_shards": 6, # 适中,不要 5*5*5*...
"number_of_replicas": 1,
"refresh_interval": "30s", # 实时性可降(原 1s)
"index.lifecycle.name": "logs_policy",
"index.lifecycle.rollover_alias": "logs",
"index.codec": "best_compression", # zstd 压缩
"index.translog.durability": "async", # 写入提速
"index.translog.sync_interval": "30s"
}
}
}
修复 2:Mapping 规范化
// 不好:dynamic mapping,字符串都当 text(可分词 + fielddata)
{
"mappings": {
"dynamic": true // 危险!字段爆炸
}
}
// 好:严格 mapping,按需选类型
PUT _index_template/logs_template
{
"template": {
"mappings": {
"dynamic": "strict", // 未定义字段拒绝
"_source": {
"enabled": true,
"excludes": ["large_blob_field"]
},
"properties": {
"@timestamp": { "type": "date" },
"level": {
"type": "keyword", // 不分词,可聚合
"ignore_above": 32
},
"service": {
"type": "keyword",
"ignore_above": 64
},
"message": {
"type": "text", // 可分词搜索
"fields": {
"keyword": { // 同时建 keyword 子字段
"type": "keyword",
"ignore_above": 256
}
}
},
"host_ip": {
"type": "ip" // ip 类型支持范围查询
},
"user_id": {
"type": "keyword",
"doc_values": true // 用于聚合排序
},
"request_size": {
"type": "long"
},
"trace_id": {
"type": "keyword",
"doc_values": false // 不参与聚合,节省存储
},
"request_headers": {
"type": "object",
"enabled": false // 完全不索引(原始 _source 还在)
},
"geoip": {
"properties": {
"location": { "type": "geo_point" }
}
}
}
}
}
}
// 规则总结:
// - 标识类(id/level/service)→ keyword
// - 全文搜索字段(message)→ text + keyword 子字段
// - 数值 → long/double
// - 时间 → date
// - 大字段不参与查询 → enabled: false 或 _source excludes
// - 用 ignore_above 防止超长字符串撑爆 keyword
修复 3:JVM 调优
# jvm.options
# 1. heap 设置 26GB(< 32GB,启用 compressed oops)
-Xms26g
-Xmx26g
# 2. GC 算法:G1(替代 CMS,适合大堆)
-XX:+UseG1GC
-XX:G1ReservePercent=25
-XX:InitiatingHeapOccupancyPercent=30
# 3. GC 日志
-Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m
# 4. 直接内存(off-heap)
-XX:MaxDirectMemorySize=13g # 约 heap 一半
# 5. heap dump on OOM
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/var/log/elasticsearch/
# elasticsearch.yml 关键配置
# 1. fielddata 限制
indices.fielddata.cache.size: 30%
indices.breaker.fielddata.limit: 40%
# 2. request circuit breaker
indices.breaker.request.limit: 60%
# 3. 总 breaker
indices.breaker.total.limit: 70%
# 4. 关闭 _all 字段(老版本)
# 已默认关闭 7.x
# 5. 集群级配置
PUT _cluster/settings
{
"persistent": {
"search.max_buckets": 65000, # 聚合 bucket 上限
"indices.recovery.max_bytes_per_sec": "200mb",
"cluster.routing.allocation.disk.watermark.low": "85%",
"cluster.routing.allocation.disk.watermark.high": "90%",
"cluster.routing.allocation.disk.watermark.flood_stage": "95%",
"indices.memory.index_buffer_size": "20%"
}
}
# 系统层
# 1. swap 关闭
swapoff -a
echo "bootstrap.memory_lock: true" >> elasticsearch.yml
# 2. file descriptor
ulimit -n 65536
# 3. virtual memory
sysctl -w vm.max_map_count=262144
修复 4:查询优化
// 1. 限制查询范围(必须带 @timestamp 过滤)
{
"query": {
"bool": {
"filter": [ // filter 不算分,可 cache
{ "range": { "@timestamp": { "gte": "now-15m" } } },
{ "term": { "service": "api" } }
],
"must": [
{ "match": { "message": "error" } }
]
}
},
"size": 50, // 不要大 size
"_source": ["@timestamp", "level", "message"], // 只取需要字段
"sort": [{ "@timestamp": "desc" }]
}
// 2. 不要用深翻页(from + size)
// 不好:第 1000 页
{ "from": 99950, "size": 50 } // 扫 10w 文档
// 好:scroll 或 search_after
// search_after(推荐,实时)
{
"size": 50,
"sort": [
{ "@timestamp": "desc" },
{ "_id": "desc" }
],
"search_after": ["2024-05-19T10:00:00Z", "abc123"]
}
// 3. 聚合限制
{
"size": 0,
"aggs": {
"by_service": {
"terms": {
"field": "service",
"size": 100, // 不要 size: 0(默认 10)
"execution_hint": "map" // 小基数用 map
}
}
}
}
// 4. 用 routing 减少分片查询
// 索引时
POST /logs-app-w20/_doc?routing=user-123
{ "user_id": "user-123", "message": "..." }
// 查询时
GET /logs-app-w20/_search?routing=user-123
{ "query": { "term": { "user_id": "user-123" } } }
// 5. 关闭不需要的字段
GET /logs-app-w20/_search
{
"track_total_hits": false, // 不要算总数(贵)
"_source": false, // 只要 _id
"stored_fields": ["@timestamp"]
}
修复 5:冷热分层架构
# 节点角色 + tier 配置
# hot 节点(NVMe SSD,接收新数据)
node.roles: [data_hot, data_content, ingest]
node.attr.data_tier: hot
# warm 节点(SAS SSD,中期数据)
node.roles: [data_warm]
node.attr.data_tier: warm
# cold 节点(HDD,长期归档)
node.roles: [data_cold]
node.attr.data_tier: cold
# 集群配置
# 3 个 hot 节点(32C/128G/4TB NVMe)
# 3 个 warm 节点(16C/64G/8TB SAS SSD)
# 3 个 cold 节点(8C/32G/16TB HDD)
# master eligible: 3 专用 master(8C/16G,不存数据)
# ILM 策略:hot 1 天 → warm 30 天 → cold 90 天 → delete
# 数据自动迁移,无需人工
# 冷数据查询(frozen tier,需要时即时恢复)
POST /logs-app-2024-w01/_freeze # 冻结索引(释放内存)
POST /logs-app-2024-w01/_search?ignore_throttled=false
{ "query": { "term": { "service": "api" } } }
# 效果:
# - hot 节点存最近 2 天,SSD 性能拉满
# - warm 节点存 30 天,中等性能
# - cold 节点存 90 天,容量优先
# - SSD 用量减少 70%,总成本降 50%
修复 6:写入优化
# 1. bulk API + 合理 batch size
# 不好:1 个 doc 1 次 HTTP(慢 100x)
# 好:bulk batch,2000-5000 doc 一批
POST /_bulk
{ "index": { "_index": "logs-app-w20" } }
{ "@timestamp": "...", "message": "..." }
{ "index": { "_index": "logs-app-w20" } }
{ "@timestamp": "...", "message": "..." }
...
# 2. refresh_interval 调高
# 默认 1s,实时性强但 IO 大
# 日志类调 30s,吞吐翻倍
PUT /logs-*/_settings
{ "refresh_interval": "30s" }
# 3. 批量导入临时关闭 refresh / replica
PUT /logs-import/_settings
{
"refresh_interval": "-1",
"number_of_replicas": 0
}
# 导入完成后恢复
PUT /logs-import/_settings
{
"refresh_interval": "30s",
"number_of_replicas": 1
}
# 4. 使用 Logstash / Fluent Bit 缓冲
# 写入失败重试,避免数据丢失
# 5. 写入限流(避免节点过载)
# Logstash output
output {
elasticsearch {
hosts => ["es-1:9200", "es-2:9200"]
workers => 4
flush_size => 5000
idle_flush_time => 5
}
}
# 写入性能监控
GET /_nodes/stats/indices/indexing
# index_total / index_time_in_millis = 平均写入耗时
# 关注 indexing.is_throttled = false
修复 7:监控告警
# Prometheus + elasticsearch_exporter
# 关键告警
- alert: ESClusterStatusRed
expr: elasticsearch_cluster_health_status{color="red"} == 1
for: 1m
labels:
severity: critical
annotations:
summary: "ES 集群 red,有未分配 primary 分片"
- alert: ESClusterStatusYellow
expr: elasticsearch_cluster_health_status{color="yellow"} == 1
for: 10m
annotations:
summary: "ES 集群 yellow > 10min(replica 未分配)"
- alert: ESNodeHeapHigh
expr: elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"} > 0.85
for: 5m
annotations:
summary: "节点 {{ $labels.name }} heap > 85%"
- alert: ESFullGCFrequent
expr: rate(elasticsearch_jvm_gc_collection_seconds_count{gc="old"}[5m]) > 1
for: 5m
annotations:
summary: "节点 {{ $labels.name }} Full GC 频繁"
- alert: ESDiskHigh
expr: elasticsearch_filesystem_data_used_bytes / elasticsearch_filesystem_data_size_bytes > 0.85
annotations:
summary: "节点 {{ $labels.name }} 磁盘 > 85%"
- alert: ESShardsUnassigned
expr: elasticsearch_cluster_health_unassigned_shards > 0
for: 5m
annotations:
summary: "{{ $value }} 个分片未分配"
- alert: ESSlowQuery
expr: histogram_quantile(0.99, rate(elasticsearch_indices_search_query_time_seconds_bucket[5m])) > 5
for: 5m
annotations:
summary: "搜索 P99 > 5s"
# Slowlog(慢查询日志)
PUT /logs-*/_settings
{
"index.search.slowlog.threshold.query.warn": "10s",
"index.search.slowlog.threshold.query.info": "5s",
"index.search.slowlog.threshold.fetch.warn": "1s",
"index.indexing.slowlog.threshold.index.warn": "5s"
}
优化效果
指标 优化前 优化后
=========================================================
集群状态 频繁 yellow/red 持续 green
索引数 8000+ 800(合并 + ILM)
分片数 6w+ 6000(单节点 1000)
节点 heap 90%+ 60-70%
Full GC 频率 5-10/min < 0.1/min
搜索 P50 延迟 1.5s 80ms
搜索 P99 延迟 8s 400ms
索引写入吞吐 5w/s 15w/s(bulk + refresh 30s)
存储占用 50TB 18TB(压缩 + ILM 删除)
成本对比:
- 全 NVMe SSD:80w/年
- 冷热分层(NVMe + SSD + HDD):35w/年(降 56%)
- 数据节点从 6 个 → 9 个(分层,容量 +50%,成本 -50%)
业务影响:
- Kibana 看板秒开,可视化日志分析顺畅
- 故障定位时间从 30min → 3min
- 历史数据可查(90 天),合规审计通过
- SRE oncall 不再被 ES 告警打扰
避坑清单
- 单分片 20-50GB 合理,小索引按周/月合并
- mapping 必须 strict,禁止 dynamic 字段爆炸
- keyword vs text 分清楚,大字段 enabled: false
- heap < 32GB(compressed oops),G1GC,memory_lock
- 查询必带 @timestamp filter,track_total_hits=false
- 深翻页用 search_after,禁用 from+size 大值
- ILM hot/warm/cold/delete,冷热分层省 50% 成本
- refresh_interval 30s,日志类批量导入关闭 refresh
- fielddata / breaker 必设上限,防止 OOM
- slowlog + Prometheus 告警必上,集群状态、GC、磁盘三件套
总结
Elasticsearch 集群治理是个系统工程,索引设计、JVM、查询、写入、分层每一块都有讲究。最大的认知改变:小索引比大索引危害更大,8000 个 100MB 的索引会让 master 节点元数据爆炸 ,而合并成 800 个 1GB 索引性能反而好。最被低估的是 ILM 冷热分层,很多团队全集群一种机型,新数据老数据混在 NVMe 上,既贵又慢;分三层后 SSD 用量降 70%,成本砍半。最容易踩的坑是 mapping 没规范,dynamic + 高基数字段 = fielddata 撑爆 heap → Full GC → 节点掉线,严格 mapping + ignore_above + enabled:false 是治本之策。最后,查询限制必须做(track_total_hits=false / size 上限 / 深翻页禁用),否则一个新人写的烂查询就能把集群拖垮 — 这一点和 Prometheus 高基数治理本质上是同类问题。
—— 别看了 · 2026