2024 年我们的 Kubernetes 生产集群:120 个节点,跑 800+ 个微服务,某次梳理成本时发现 — 集群 CPU 平均利用率只有 18%,但节点经常因为内存不够调度失败,业务高峰期又频繁 OOM 和被驱逐,大促前只能靠盲目堆机器。投了三周做资源调度治理,集群利用率提升到 45%,节点数从 120 降到 75,高峰自动扩容,稳定性反而更好。本文复盘 K8s 资源请求 / 限制 / HPA / VPA / 调度治理的完整实战。
问题背景
集群:Kubernetes 1.28,120 节点(32C/128G)
工作负载:800+ Deployment,5000+ Pod
核心矛盾:
- 节点 CPU 平均利用率 18%(资源浪费)
- 节点内存"装满"但 CPU 闲(调度不均)
- 高峰期 Pod OOMKilled / Evicted
- 大促靠人肉扩 replica,凭感觉
排查发现:
1. 大量 Pod 没设 requests/limits
$ kubectl get pods -o json | jq '[.items[] |
select(.spec.containers[].resources.requests == null)] | length'
→ 1800+ 个 Pod 裸跑
2. 设了的也不准
- requests 拍脑袋写 4C8G,实际只用 0.3C1G
- limits == requests,没弹性
3. 调度全堆在少数节点
- 没用亲和/反亲和,同服务 Pod 挤一个节点
- 节点故障一次挂掉服务大半实例
4. 没有 HPA,replica 数手工写死
- 低峰浪费,高峰不够
5. requests 虚高导致"假装满了"
- 节点 requests 总和 95%,实际用 18%
- 调度器以为没资源,新 Pod pending
修复 1:精准设置 requests / limits
# 核心概念:
# requests = 调度依据 + 资源保障(节点至少留这么多)
# limits = 使用上限(超了 CPU 被限流,内存被 OOMKill)
apiVersion: apps/v1
kind: Deployment
metadata:
name: order-service
spec:
template:
spec:
containers:
- name: app
resources:
requests:
cpu: "500m" # 实测均值 + buffer
memory: "1Gi" # 实测峰值(内存不可压缩!)
limits:
cpu: "2000m" # 允许突发到 4x(CPU 可压缩)
memory: "1.5Gi" # 略高于 requests,留安全垫
# 关键原则:
# 1. CPU:requests 设均值,limits 给突发空间(CPU 是可压缩资源)
# 2. 内存:requests ≈ limits(内存不可压缩,超了直接 OOMKill)
# memory limits 不要远高于 requests,否则节点超卖被驱逐
# 3. requests 决定 QoS 等级:
# QoS 三档(节点资源紧张时驱逐顺序)
# - Guaranteed: requests == limits(最后被驱逐)→ 核心服务
# - Burstable: requests < limits(其次)→ 一般服务
# - BestEffort: 都不设(最先被驱逐)→ 禁止用于生产
# 怎么定准数值:看监控历史
# CPU requests = P50 用量
# CPU limits = P99 用量 × 1.5
# Memory requests/limits = P99 用量 × 1.2
# 用 VPA 推荐模式自动算(不自动改,只给建议)
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: order-service-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: order-service
updatePolicy:
updateMode: "Off" # Off = 只推荐不改
resourcePolicy:
containerPolicies:
- containerName: app
minAllowed: { cpu: "100m", memory: "128Mi" }
maxAllowed: { cpu: "4", memory: "8Gi" }
# 查看 VPA 推荐
# kubectl describe vpa order-service-vpa
# Recommendation:
# Target: cpu: 480m memory: 1100Mi ← 按这个调
修复 2:HPA 水平自动伸缩
# HPA:根据负载自动增减 Pod 数
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: order-service-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: order-service
minReplicas: 4
maxReplicas: 50
metrics:
# 1. CPU 利用率(基于 requests 的百分比)
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60 # CPU 用到 requests 的 60% 就扩
# 2. 内存利用率
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 70
# 3. 自定义指标(QPS / 队列长度,需 prometheus-adapter)
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "1000" # 单 Pod 超 1000 QPS 就扩
# 扩缩容行为(防抖动)
behavior:
scaleUp:
stabilizationWindowSeconds: 30 # 30s 观察窗口
policies:
- type: Percent
value: 100 # 每次最多翻倍
periodSeconds: 60
- type: Pods
value: 10 # 或每次最多加 10 个
periodSeconds: 60
selectPolicy: Max
scaleDown:
stabilizationWindowSeconds: 300 # 缩容更保守,观察 5min
policies:
- type: Percent
value: 20 # 每次最多缩 20%
periodSeconds: 60
# 自定义指标需要 prometheus-adapter
# kubectl apply -f prometheus-adapter.yaml
# HPA 从 Prometheus 拿 http_requests_per_second
# 测试 HPA
# kubectl get hpa order-service-hpa -w
# NAME TARGETS MINPODS MAXPODS REPLICAS
# order-service-hpa 45%/60%, 800/1k 4 50 4
# 压测后:
# order-service-hpa 85%/60%, 1.5k/1k 4 50 12
修复 3:调度策略(亲和 / 反亲和 / 拓扑)
apiVersion: apps/v1
kind: Deployment
metadata:
name: order-service
spec:
template:
spec:
# 1. Pod 反亲和:同服务 Pod 分散到不同节点
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: order-service
topologyKey: kubernetes.io/hostname # 按节点分散
# 2. 节点亲和:跑在 SSD 节点
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: disktype
operator: In
values: ["ssd"]
# 3. 拓扑分布约束:跨可用区均匀分布
topologySpreadConstraints:
- maxSkew: 1 # 各区 Pod 数差不超过 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: order-service
- maxSkew: 2
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
# 4. 污点容忍(跑在专用节点池)
tolerations:
- key: "dedicated"
operator: "Equal"
value: "order"
effect: "NoSchedule"
# 5. 优先级(资源紧张时优先调度)
priorityClassName: high-priority
---
# PriorityClass 定义
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000
globalDefault: false
description: "核心交易服务"
---
# 资源紧张时,低优先级 Pod 被抢占
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: low-priority
value: 1000
description: "离线任务,可被抢占"
修复 4:节点弹性(Cluster Autoscaler)
# Pod 扩了,节点不够怎么办?→ Cluster Autoscaler 自动加节点
apiVersion: apps/v1
kind: Deployment
metadata:
name: cluster-autoscaler
namespace: kube-system
spec:
template:
spec:
containers:
- name: cluster-autoscaler
image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.28.0
command:
- ./cluster-autoscaler
- --cloud-provider=alicloud
- --nodes=3:30:nodepool-order # 节点池 3-30 个节点
- --scale-down-enabled=true
- --scale-down-delay-after-add=10m # 加节点后 10min 才考虑缩
- --scale-down-unneeded-time=10m # 节点空闲 10min 才缩
- --scale-down-utilization-threshold=0.5 # 利用率 < 50% 缩
- --skip-nodes-with-local-storage=false
- --expander=least-waste # 选最省资源的节点规格
# 工作流程:
# 1. Pod pending(没节点能调度)
# 2. CA 检测到 → 调用云 API 加节点
# 3. 新节点 ready → Pod 被调度
# 4. 节点长期空闲 → CA 缩容(先驱逐 Pod 再删节点)
# === Karpenter(更现代的方案,AWS/支持多云)===
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: default
spec:
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"] # 优先用 spot 省钱
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
limits:
cpu: 1000
disruption:
consolidationPolicy: WhenUnderutilized # 自动整合,减少碎片
expireAfter: 720h
# PodDisruptionBudget:保证缩容时服务不中断
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: order-service-pdb
spec:
minAvailable: 75% # 至少 75% Pod 可用
selector:
matchLabels:
app: order-service
修复 5:资源配额 + LimitRange
# 1. ResourceQuota:限制 namespace 总资源(防某团队占满集群)
apiVersion: v1
kind: ResourceQuota
metadata:
name: order-team-quota
namespace: order
spec:
hard:
requests.cpu: "200"
requests.memory: 400Gi
limits.cpu: "400"
limits.memory: 600Gi
pods: "500"
persistentvolumeclaims: "50"
---
# 2. LimitRange:namespace 内 Pod 的默认值 + 上下限
apiVersion: v1
kind: LimitRange
metadata:
name: order-limit-range
namespace: order
spec:
limits:
- type: Container
default: # 没设 limits 时的默认值
cpu: "500m"
memory: "512Mi"
defaultRequest: # 没设 requests 时的默认值
cpu: "200m"
memory: "256Mi"
max: # 单容器上限
cpu: "8"
memory: "16Gi"
min: # 单容器下限
cpu: "50m"
memory: "64Mi"
maxLimitRequestRatio: # limits/requests 比例上限
cpu: "10"
memory: "2" # 内存 limits 不超 requests 2 倍
# 效果:
# - 裸跑 Pod 自动套用默认 requests/limits(不再 BestEffort)
# - 防止某个 Pod 申请 100C200G 这种离谱值
# - 内存 limits/requests 强制 ≤ 2,防止超卖驱逐
修复 6:监控告警
# Prometheus + kube-state-metrics + metrics-server
# 1. 集群利用率
- record: cluster:cpu_utilization
expr: |
sum(rate(container_cpu_usage_seconds_total{container!=""}[5m]))
/ sum(kube_node_status_allocatable{resource="cpu"})
# 2. requests 与实际用量差距(找虚高的)
- alert: PodRequestsTooHigh
expr: |
(sum by(namespace, pod) (kube_pod_container_resource_requests{resource="cpu"})
- sum by(namespace, pod) (rate(container_cpu_usage_seconds_total[1h])))
/ sum by(namespace, pod) (kube_pod_container_resource_requests{resource="cpu"}) > 0.7
for: 24h
annotations:
summary: "{{ $labels.pod }} CPU requests 虚高 70%+,建议调小"
# 3. OOMKilled
- alert: PodOOMKilled
expr: |
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} == 1
annotations:
summary: "{{ $labels.pod }} 被 OOMKill,memory limits 太小"
# 4. Pod 被驱逐
- alert: PodEvicted
expr: kube_pod_status_reason{reason="Evicted"} == 1
annotations:
summary: "{{ $labels.pod }} 被驱逐,节点资源不足"
# 5. HPA 打满 maxReplicas
- alert: HPAMaxedOut
expr: |
kube_horizontalpodautoscaler_status_current_replicas
== kube_horizontalpodautoscaler_spec_max_replicas
for: 15m
annotations:
summary: "{{ $labels.horizontalpodautoscaler }} 已达 maxReplicas,需调高"
# 6. 节点资源压力
- alert: NodeMemoryPressure
expr: kube_node_status_condition{condition="MemoryPressure",status="true"} == 1
annotations:
summary: "节点 {{ $labels.node }} 内存压力"
# 7. Pending Pod(调度失败)
- alert: PodPending
expr: kube_pod_status_phase{phase="Pending"} == 1
for: 10m
annotations:
summary: "{{ $labels.pod }} pending > 10min,资源不足或调度约束冲突"
优化效果
指标 优化前 优化后
=========================================================
集群 CPU 利用率 18% 45%
集群内存利用率 62% 68%
节点数 120 75
裸跑 Pod(无 requests) 1800+ 0(LimitRange 兜底)
OOMKilled / 天 50+ < 3
Pod Evicted / 天 30+ 0
Pending Pod 频繁 CA 自动加节点解决
大促扩容 人肉改 replica HPA + CA 全自动
成本:
- 节点 120 → 75,省 45 台(32C128G)
- 云成本下降:约 110w/年
- spot 实例占比 40%,再省 30%
稳定性(反直觉地变好了):
- 反亲和让单节点故障只影响服务少量实例
- HPA 高峰自动扩容,不再"扛不住"
- PDB 保证缩容/升级时服务不中断
- requests 设准后,调度真实反映资源,不再"假装满"
避坑清单
- 每个 Pod 必设 requests/limits,LimitRange 兜底防 BestEffort
- CPU limits 可给突发空间,内存 requests≈limits(内存不可压缩)
- requests 按监控历史定(CPU 取 P50,内存取 P99),别拍脑袋
- VPA 用 Off 模式只看推荐,自动改有风险
- 核心服务用 Guaranteed QoS(requests==limits),最后被驱逐
- HPA 配 behavior 防抖动:扩容快、缩容慢
- podAntiAffinity 让同服务 Pod 分散,降低故障爆炸半径
- HPA + Cluster Autoscaler 配套:Pod 扩了才有节点接
- PodDisruptionBudget 保证缩容/升级时服务不中断
- 监控 requests 虚高、OOMKilled、Pending,持续校准
总结
Kubernetes 资源调度治理是一件"既省钱又增稳"的事,这两个目标通常对立,但在 K8s 上它们是统一的。最大的认知改变:低利用率不是"资源富裕"而是"资源浪费 + 调度失真",requests 虚高会让调度器以为节点满了,实际只用了 18%,新 Pod 无处调度只能加机器 —— 治理的第一步永远是把 requests 调准。最被低估的是内存和 CPU 的本质区别:CPU 是可压缩资源,超了只是被限流变慢;内存是不可压缩资源,超了直接 OOMKill,所以内存的 requests 和 limits 必须贴近,而 CPU 的 limits 可以大方给突发空间。最容易踩的坑是只配 HPA 不配 Cluster Autoscaler,HPA 把 Pod 扩出来了,却没有节点能调度,一堆 Pod pending,弹性伸缩形同虚设 —— 这两个必须配套。最后,一个反直觉的结论:治理之后节点少了 45 台,稳定性反而更好了,因为反亲和让故障不再集中、HPA 让高峰自动扛住、PDB 让变更不中断 —— 真正的稳定来自弹性和分散,而不是堆机器堆出来的冗余。
—— 别看了 · 2026