K8s HPA 只能按 CPU/内存扩缩,业务实际是按队列堆积、订单量、QPS 决定扩缩的。我们一个消息处理服务以前用 cron 扩缩,凌晨缩到 5 个,白天扩到 50 个,但实际负载和时间不完全相关,经常扩晚或缩快。引入 KEDA 后改成按 Kafka lag 实时扩缩,资源用量降 35%,SLA 反而提升。本文记录 KEDA 落地全过程。
HPA 的局限
原生 HPA(K8s 自带)只能:
- 按 CPU(平均利用率)
- 按内存
- 按 Custom Metric(要 metric-server + adapter)
但业务真实需求往往是:
- 按 Kafka topic lag 扩缩消费者
- 按 RabbitMQ queue 长度扩缩 worker
- 按 Redis stream length 扩缩
- 按 Prometheus 自定义指标(QPS / 错误率 / 队列深度)
- 按数据库连接数
- 按定时表达式(cron 模式)
- 按 AWS SQS / Azure Service Bus / GCP PubSub 等云队列
HPA 配 Custom Metric Adapter 很麻烦,而且不支持 cron / 不支持 scale-to-zero
KEDA 是什么
KEDA = Kubernetes-based Event-Driven Autoscaling
- 基于事件源 / 外部指标扩缩 Pod
- 支持 60+ scaler(Kafka / RabbitMQ / Redis / Prometheus / Postgres / 各种云服务)
- 支持 scale-to-zero(没事干就缩到 0)
- 在 HPA 之上,而不是替代 HPA
- CNCF Graduated 项目,生产就绪
工作流程:
1. KEDA Operator 监听 ScaledObject CR
2. 调用 scaler 查询事件源指标
3. 生成或调整 HPA(External Metric)
4. HPA 决定 Pod 数
支持模式:
- 普通 Deployment / StatefulSet 扩缩
- ScaledJob(扩缩 Job 而非 Deployment)
- 多 scaler 组合(OR 逻辑)
部署 KEDA
# Helm 安装
$ helm repo add kedacore https://kedacore.github.io/charts
$ helm repo update
$ helm install keda kedacore/keda \
--namespace keda --create-namespace \
--version 2.13.0 \
--set prometheus.metricServer.enabled=true \
--set prometheus.operator.enabled=true
# 验证
$ kubectl -n keda get pods
NAME READY STATUS
keda-operator-78b... 1/1 Running
keda-operator-metrics-apiserver-67c... 1/1 Running
keda-admission-webhooks-5b... 1/1 Running
# CRDs
$ kubectl get crd | grep keda
scaledobjects.keda.sh
scaledjobs.keda.sh
triggerauthentications.keda.sh
clustertriggerauthentications.keda.sh
场景 1:Kafka lag 扩缩消费者
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: order-consumer
namespace: business
spec:
scaleTargetRef:
name: order-consumer # 要扩缩的 Deployment
minReplicaCount: 2 # 最小 2 个
maxReplicaCount: 50 # 最大 50 个
pollingInterval: 15 # 15 秒查一次 lag
cooldownPeriod: 300 # 缩容前等 5 分钟,避免抖动
idleReplicaCount: 0 # 0 lag 时缩到 0(可选)
triggers:
- type: kafka
metadata:
bootstrapServers: kafka-cluster:9092
consumerGroup: order-consumer-cg
topic: orders.events
lagThreshold: "1000" # 每个 Pod 处理 1000 条 lag
offsetResetPolicy: latest
allowIdleConsumers: "false"
scaleToZeroOnInvalidOffset: "false"
authenticationRef:
name: kafka-auth # 复用 secret
---
apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
name: kafka-auth
namespace: business
spec:
secretTargetRef:
- parameter: sasl
name: kafka-secret
key: sasl
- parameter: username
name: kafka-secret
key: username
- parameter: password
name: kafka-secret
key: password
效果:
- 凌晨 lag = 50,2 Pod 处理(idle Pod 用不到)
- 白天 lag 飙到 10w,自动扩到 50 Pod(50 × 2000 = 10w 容量)
- 处理完,5 分钟 cooldown 后缩回 2
KEDA 的扩缩比 cron 更"灵敏":
- cron 是按时间猜负载
- KEDA 是按真实指标判断
场景 2:Prometheus 指标扩缩(自定义)
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: api-gateway-scaler
spec:
scaleTargetRef:
name: api-gateway
minReplicaCount: 3
maxReplicaCount: 30
triggers:
# 触发条件 1:QPS > 5000
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
threshold: '5000'
query: |
sum(rate(http_requests_total{job="api-gateway"}[1m]))
# 触发条件 2:p99 延迟 > 500ms
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
threshold: '0.5' # 500ms = 0.5s
query: |
histogram_quantile(0.99,
sum by (le) (rate(http_request_duration_seconds_bucket{job="api-gateway"}[1m]))
)
# 多个触发器是 OR 逻辑
# 任何一个达到阈值就扩
场景 3:RabbitMQ 队列扩缩
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: email-worker
spec:
scaleTargetRef:
name: email-worker
minReplicaCount: 1
maxReplicaCount: 20
pollingInterval: 10
cooldownPeriod: 120
triggers:
- type: rabbitmq
metadata:
protocol: amqp
queueName: email.send
mode: QueueLength
value: "100" # 每个 Pod 处理 100 条堆积
authenticationRef:
name: rabbitmq-auth
---
apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
name: rabbitmq-auth
spec:
secretTargetRef:
- parameter: host
name: rabbitmq-secret
key: amqp_uri
场景 4:Cron 扩缩(白天扩夜里缩)
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: batch-job-cron
spec:
scaleTargetRef:
name: batch-job
triggers:
# 工作日 9-19 点扩到 20 个
- type: cron
metadata:
timezone: Asia/Shanghai
start: "0 9 * * 1-5"
end: "0 19 * * 1-5"
desiredReplicas: "20"
# 周末 10-22 扩到 10 个
- type: cron
metadata:
timezone: Asia/Shanghai
start: "0 10 * * 6,0"
end: "0 22 * * 6,0"
desiredReplicas: "10"
# 其他时间靠 Kafka lag 自动调整
- type: kafka
metadata:
bootstrapServers: kafka:9092
consumerGroup: batch-cg
topic: batch.jobs
lagThreshold: "500"
场景 5:ScaledJob(扩缩 Job)
# 适合一次性任务:一条消息起一个 Pod 处理,完了销毁
apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
name: image-process-job
spec:
jobTargetRef:
parallelism: 1
completions: 1
backoffLimit: 3
template:
spec:
containers:
- name: worker
image: image-processor:latest
resources:
requests: {cpu: 500m, memory: 512Mi}
restartPolicy: Never
pollingInterval: 10
successfulJobsHistoryLimit: 5
failedJobsHistoryLimit: 5
maxReplicaCount: 100
scalingStrategy:
strategy: "default" # 或 "accurate" / "custom"
triggers:
- type: aws-sqs-queue
metadata:
queueURL: https://sqs.us-east-1.amazonaws.com/.../image-queue
queueLength: "1" # 每条消息起一个 Job
awsRegion: us-east-1
authenticationRef:
name: aws-auth
HPA 行为对比
# 普通 HPA(KEDA 之前)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: order-consumer-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: order-consumer
minReplicas: 5
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70 # CPU > 70% 扩
behavior:
scaleDown:
stabilizationWindowSeconds: 300
# 问题:消息消费者 CPU 经常 100%(批量处理),但消息很快消化完
# CPU 维度无法反映真实负载
KEDA 自动生成的 HPA
# 创建 ScaledObject 后,KEDA 会自动生成对应 HPA
$ kubectl get hpa -n business
NAME REFERENCE TARGETS
keda-hpa-order-consumer Deployment/order-consumer 1450/1000 (avg)
# 内部用 External Metric API
$ kubectl get apiservice v1beta1.external.metrics.k8s.io
v1beta1.external.metrics.k8s.io keda/keda-operator-metrics-apiserver
# 验证 metric 可读
$ kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces/business" | jq
scale-to-zero 注意事项
scale-to-zero 适合:
✓ 消息驱动:Kafka / RabbitMQ / SQS / Service Bus
✓ 真实"闲时无活"场景(凌晨夜间)
✗ HTTP 服务(请求来了再启 Pod 太慢,Knative 才适合)
注意:
1. 冷启动延迟:Pod 启动 + readiness 通常 30s-2min
2. minReplicaCount: 0 + idleReplicaCount: 0
3. 第一条消息到来后 KEDA 触发扩容
4. 适合"对延迟不敏感"的后台任务
设置:
spec:
minReplicaCount: 0 # 缩到 0
idleReplicaCount: 0
cooldownPeriod: 300 # 5 分钟无消息才缩
triggers: [...]
避坑实战
坑 1:扩缩抖动
# 不要把 lagThreshold 设太小
# Threshold 100,lag 在 95-105 之间反复抖,Pod 不停扩缩
# 修法:
# 1. lagThreshold 留余量(实际处理能力的 60-70%)
# 2. cooldownPeriod 拉长(300-600s)
# 3. HPA behavior 加 stabilization
advanced:
horizontalPodAutoscalerConfig:
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0 # 扩容立即响应
policies:
- type: Percent
value: 100
periodSeconds: 30
坑 2:认证泄漏
# 不要把 Kafka / DB 密码写 ScaledObject 里
# ScaledObject 是 namespace 级,但 trigger metadata 是明文
# 正确做法:用 TriggerAuthentication + Secret
apiVersion: v1
kind: Secret
metadata:
name: kafka-secret
namespace: business
type: Opaque
stringData:
sasl: "scram_sha512"
username: "consumer-user"
password: "${KAFKA_PASSWORD}"
---
apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
name: kafka-auth
spec:
secretTargetRef:
- parameter: sasl
name: kafka-secret
key: sasl
- parameter: username
name: kafka-secret
key: username
- parameter: password
name: kafka-secret
key: password
坑 3:Prometheus 查询超时
# KEDA 每 15s 查 Prometheus,复杂查询超过 5s 会失败
# 修法:
# 1. 用 recording rule 预聚合
# 2. 加查询超时配置
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
query: |
job:http_requests:rate1m{job="api-gateway"} # 预聚合 metric
threshold: '5000'
customHeaders: '{"X-Custom-Auth": "..."}'
unsafeSsl: 'false'
activationThreshold: '100' # 低于 100 不扩
坑 4:多 trigger 互相打架
# 多个 trigger 是 OR(任一满足就扩),不是 AND
# 例如:
# trigger 1: lag > 1000 → 扩到 30
# trigger 2: QPS > 5000 → 扩到 20
# 同时满足:取大值 30,不是相加
# 如果想"复合条件"扩,要写 Prometheus 查询:
sum(kafka_consumer_lag) > 1000 and sum(http_qps) > 5000
监控 KEDA 自身
# Prometheus 抓 KEDA metric
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: keda-operator
namespace: keda
spec:
selector:
matchLabels:
app: keda-operator-metrics-apiserver
endpoints:
- port: metrics
interval: 30s
# 关键 metric
keda_scaler_metrics_value # 每个 scaler 当前值
keda_scaler_metrics_latency # scaler 查询耗时
keda_scaler_errors_total # scaler 失败次数
keda_scaledobject_paused # 暂停的 ScaledObject
# 告警
- alert: KEDAScalerError
expr: rate(keda_scaler_errors_total[5m]) > 0.1
labels: { severity: warning }
落地后效果
服务:order-consumer(Kafka 消费)
cron 时代:
- 凌晨 5 Pod,白天 50 Pod
- 突发流量靠人工干预
- 资源用量(平均):30 Pod
KEDA 时代:
- 凌晨 1-3 Pod
- 白天 10-40 Pod(按 lag)
- 突发流量 1 分钟内扩到 50
- 资源用量(平均):18 Pod
降本:
- 资源 -40%(按 Pod * 时长)
- 凌晨可以缩到 0(每月省 200 元)
- 突发处理时间 5min → 1min
SLA:
- 消息处理延迟 p99 12min → 2min
避坑清单
- 不要用 minReplicaCount: 0 + 对延迟敏感的服务
- lagThreshold / queueLength 按"单 Pod 处理能力"× 70% 设
- cooldownPeriod 不要太短,否则扩缩抖动
- 多 trigger 是 OR 逻辑,要 AND 用 Prometheus 复合查询
- 认证用 TriggerAuthentication + Secret,不写明文
- HPA behavior 加 stabilizationWindow 平滑
- 监控 KEDA 自身 metric,scaler error 要告警
- 升级 KEDA 看 CRD 是否变化(2.x 内有 breaking)
- ScaledJob 适合一次性任务,ScaledObject 适合常驻 Deployment
- scale-to-zero 配合 Knative 用更香(HTTP 服务也能 0-1)
总结
KEDA 把 K8s 的扩缩从"按 CPU 猜负载"升级为"按业务事件源直接判断"。配置 ScaledObject 一个 YAML,后面 KEDA 接管 HPA 生成,业务零感知。我们落地 6 个月,15 个服务都用了 KEDA,降本 35% + SLA 提升。最大的认知改变:扩缩不是基础设施事,是业务事。订单消费者按 lag 扩,API 按 QPS+延迟扩,定时报表按 cron 扩。KEDA 让"按真实负载扩缩"从想法变成 YAML。CNCF Graduated 的标志意味着可以放心生产用。
—— 别看了 · 2026