AI系统的容量弹性:Kubernetes VPA与HPA的AI场景调优
AI系统的容量弹性:Kubernetes VPA与HPA的AI场景调优
开篇故事:孙涛的"过山车"
2025年11月,某新闻资讯平台的SRE工程师孙涛每天都要经历一场过山车。
他们的AI摘要服务承接全站的新闻摘要生成任务,流量模式极其独特:
- 凌晨2点-7点:流量谷底,约200 QPS,4个Pod就够用
- 早上7点-9点:流量骤升,15分钟内QPS从200暴增到3500
- 下午12点-2点:午休高峰,2800 QPS
- 晚上10点:每日重大事件推送,可能瞬间到8000 QPS
为了应对峰值,团队维持着40个Pod始终在线。月云服务器账单:¥23万。
孙涛算了一笔账:如果按平均负载配置(15个Pod),可以节省¥14万/月;如果能动态伸缩,做到"刚好够用",理论上可以节省¥17万/月。
但AI服务的弹性伸缩有一个普通服务没有的难题:GPU资源的伸缩速度远慢于流量增长速度——一个新GPU Pod从创建到就绪需要3-5分钟,而流量峰值可能在1分钟内来临。
孙涛的解决方案是"预测性伸缩+响应式HPA"的组合:
- 基于历史数据预测,提前5分钟扩容
- HPA监控Token消耗率,响应式补充
- VPA自动调整资源请求,避免资源浪费
- Karpenter节点自动伸缩,底层GPU节点按需创建
最终效果:月账单降低至¥9.8万(节省57%),同时P99延迟从1800ms降低到950ms(因为资源配置更精准了)。
TL;DR
- AI工作负载特点:突发流量、GPU资源稀缺、冷启动慢
- HPA自定义指标:按Token消耗率(而非CPU)自动扩容
- VPA:分析历史资源使用,自动调整Pod的resource request/limit
- 预测性伸缩:基于KEDA + Cron触发提前扩容
- 节点自动伸缩:Karpenter按工作负载需求自动创建/销毁节点
一、AI工作负载的特殊性分析
1.1 为什么CPU/内存指标不适合AI服务的HPA?
传统HPA依赖CPU使用率触发扩容,但AI服务的CPU使用率与实际负载的关系很弱:
场景:AI服务调用远程LLM API
用户请求 → Java应用 → HTTP调用LLM API(等待响应)
↑
等待期间CPU接近0%!
结果:CPU=10%,但线程全部在等待IO,新请求全部超时
HPA看到CPU低,不扩容,服务雪崩正确的扩容指标应该是:
- 活跃请求数(并发等待的AI请求数量)
- 请求延迟(P95延迟超过阈值时触发)
- Token消耗速率(直接反映AI负载)
- 队列深度(等待处理的请求队列长度)
1.2 GPU服务的特殊考虑
GPU服务的资源特征:
├── 单机多并发:一个GPU可以并行处理N个推理请求
├── 内存瓶颈:GPU显存是主要限制(不是计算单元)
├── 批处理增益:N个请求一起推理比N次单独推理快
└── 冷启动慢:GPU初始化 + 模型加载 = 3-10分钟
弹性策略影响:
├── 扩容:必须提前,不能响应式(太慢)
└── 缩容:可以滞后,但要避免频繁重启(模型加载贵)二、HPA基础配置
2.1 基于CPU的标准HPA(不推荐用于AI服务,但作为基础了解)
# basic-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: ai-chat-service-hpa
namespace: ai-services
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: ai-chat-service
minReplicas: 3
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
# 伸缩行为控制(防止频繁伸缩)
behavior:
scaleUp:
stabilizationWindowSeconds: 60 # 触发扩容前等待60秒
policies:
- type: Percent
value: 100 # 每次最多扩容100%(当前数量的2倍)
periodSeconds: 60
- type: Pods
value: 5 # 每次最多增加5个Pod
periodSeconds: 60
selectPolicy: Max # 取两个策略中扩容较多的那个
scaleDown:
stabilizationWindowSeconds: 300 # 缩容前等待5分钟(AI服务不要急着缩)
policies:
- type: Percent
value: 20 # 每次最多缩容20%
periodSeconds: 1202.2 基于自定义指标的HPA(AI服务推荐)
# custom-metrics-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: ai-chat-service-custom-hpa
namespace: ai-services
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: ai-chat-service
minReplicas: 3
maxReplicas: 100
metrics:
# 指标1:活跃请求数(每个Pod最多处理50个并发AI请求)
- type: Pods
pods:
metric:
name: ai_active_requests
target:
type: AverageValue
averageValue: "50"
# 指标2:P95延迟(P95超过3秒就扩容)
- type: Pods
pods:
metric:
name: ai_request_latency_p95_ms
target:
type: AverageValue
averageValue: "3000"
# 指标3:Token消耗速率(每个Pod每秒消耗超过5000 Token就扩容)
- type: Pods
pods:
metric:
name: ai_token_consumption_rate
target:
type: AverageValue
averageValue: "5000"
behavior:
scaleUp:
stabilizationWindowSeconds: 30
policies:
- type: Percent
value: 200 # AI服务允许快速扩容(最多3倍)
periodSeconds: 30
scaleDown:
stabilizationWindowSeconds: 600 # 缩容等待10分钟
policies:
- type: Percent
value: 10
periodSeconds: 300三、自定义指标上报:Spring AI + Prometheus
3.1 在Spring AI服务中上报自定义指标
// AiMetricsExporter.java
@Component
@Slf4j
public class AiMetricsExporter {
private final MeterRegistry meterRegistry;
// 活跃AI请求计数器(原子操作保证线程安全)
private final AtomicInteger activeRequests = new AtomicInteger(0);
// Token消耗速率(使用滑动窗口计算)
private final SlidingWindowCounter tokenCounter;
// 延迟分布
private final Timer requestTimer;
public AiMetricsExporter(MeterRegistry meterRegistry) {
this.meterRegistry = meterRegistry;
this.tokenCounter = new SlidingWindowCounter(60, TimeUnit.SECONDS);
// 注册Gauge(实时活跃请求数)
Gauge.builder("ai_active_requests", activeRequests, AtomicInteger::get)
.description("当前活跃的AI请求数量")
.register(meterRegistry);
// 注册Token消耗速率Gauge
Gauge.builder("ai_token_consumption_rate", tokenCounter,
c -> c.getRate(TimeUnit.SECONDS))
.description("每秒Token消耗速率")
.register(meterRegistry);
// 请求延迟Timer
this.requestTimer = Timer.builder("ai_request_duration_ms")
.description("AI请求延迟(毫秒)")
.publishPercentiles(0.5, 0.75, 0.95, 0.99)
.register(meterRegistry);
}
// 在ChatClient调用前后调用
public <T> T trackRequest(Supplier<T> aiCall, int estimatedTokens) {
activeRequests.incrementAndGet();
long startTime = System.currentTimeMillis();
try {
T result = aiCall.get();
return result;
} finally {
activeRequests.decrementAndGet();
long duration = System.currentTimeMillis() - startTime;
requestTimer.record(duration, TimeUnit.MILLISECONDS);
tokenCounter.add(estimatedTokens);
}
}
// 获取P95延迟供外部查询
public double getP95LatencyMs() {
return requestTimer.percentile(0.95) / 1_000_000.0; // 纳秒转毫秒
}
}
// 与ChatClient集成
@Service
public class TrackedChatService {
private final ChatClient chatClient;
private final AiMetricsExporter metricsExporter;
public String chat(String userMessage) {
return metricsExporter.trackRequest(() -> {
return chatClient.prompt()
.user(userMessage)
.call()
.content();
}, estimateTokens(userMessage));
}
private int estimateTokens(String text) {
// 简单估算:中文约1.5字/token,英文约4字/token
return (int)(text.length() * 1.5);
}
}3.2 Prometheus Adapter配置
需要配置Prometheus Adapter将Prometheus指标转换为Kubernetes自定义指标:
# prometheus-adapter-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: adapter-config
namespace: monitoring
data:
config.yaml: |
rules:
# 活跃AI请求数
- seriesQuery: 'ai_active_requests{namespace!="",pod!=""}'
resources:
overrides:
namespace: {resource: "namespace"}
pod: {resource: "pod"}
name:
matches: "^(.*)$"
as: "ai_active_requests"
metricsQuery: 'avg_over_time(ai_active_requests{<<.LabelMatchers>>}[2m])'
# Token消耗速率
- seriesQuery: 'ai_token_consumption_rate{namespace!="",pod!=""}'
resources:
overrides:
namespace: {resource: "namespace"}
pod: {resource: "pod"}
name:
matches: "^(.*)$"
as: "ai_token_consumption_rate"
metricsQuery: 'avg_over_time(ai_token_consumption_rate{<<.LabelMatchers>>}[1m])'
# P95延迟
- seriesQuery: 'ai_request_duration_ms{namespace!="",pod!="",quantile="0.95"}'
resources:
overrides:
namespace: {resource: "namespace"}
pod: {resource: "pod"}
name:
matches: "^(.*)$"
as: "ai_request_latency_p95_ms"
metricsQuery: 'ai_request_duration_ms{quantile="0.95",<<.LabelMatchers>>}'四、VPA:自动调整资源请求
4.1 VPA配置
# vpa-config.yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: ai-chat-service-vpa
namespace: ai-services
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: ai-chat-service
updatePolicy:
# 注意:对AI服务建议使用Off模式(只建议,不自动更新)
# Auto模式会重启Pod,对在线服务影响大
updateMode: "Off" # Off/Initial/Recreate/Auto
resourcePolicy:
containerPolicies:
- containerName: ai-chat-service
# 设置资源上下限
minAllowed:
cpu: "500m"
memory: "512Mi"
maxAllowed:
cpu: "4000m"
memory: "8Gi"
# 控制哪些资源由VPA管理
controlledResources: ["cpu", "memory"]
controlledValues: RequestsAndLimits4.2 读取VPA建议并应用
// VpaRecommendationApplier.java
@Service
@Slf4j
@Scheduled(fixedRate = 3600000) // 每小时检查一次VPA建议
public class VpaRecommendationApplier {
private final KubernetesClient kubernetesClient;
@Scheduled(fixedRate = 3600000)
public void applyVpaRecommendations() {
// 获取VPA建议
VerticalPodAutoscalerList vpaList = kubernetesClient
.resources(VerticalPodAutoscaler.class)
.inNamespace("ai-services")
.list();
for (VerticalPodAutoscaler vpa : vpaList.getItems()) {
RecommendedContainerResources recommendations =
vpa.getStatus().getRecommendation().getContainerRecommendations()
.stream()
.filter(r -> "ai-chat-service".equals(r.getContainerName()))
.findFirst()
.orElse(null);
if (recommendations == null) continue;
Quantity recommendedCpu = recommendations.getTarget().get("cpu");
Quantity recommendedMemory = recommendations.getTarget().get("memory");
log.info("VPA建议 - CPU: {}, Memory: {}", recommendedCpu, recommendedMemory);
// 在非高峰期(如凌晨3-5点)自动应用建议
LocalTime now = LocalTime.now();
if (now.isAfter(LocalTime.of(3, 0)) &&
now.isBefore(LocalTime.of(5, 0))) {
applyRecommendation("ai-chat-service",
recommendedCpu, recommendedMemory);
} else {
// 发送告警,等待人工审核
sendRecommendationAlert(vpa.getMetadata().getName(),
recommendedCpu, recommendedMemory);
}
}
}
private void applyRecommendation(String deploymentName,
Quantity cpu, Quantity memory) {
kubernetesClient.apps().deployments()
.inNamespace("ai-services")
.withName(deploymentName)
.edit(d -> {
d.getSpec().getTemplate().getSpec().getContainers()
.stream()
.filter(c -> deploymentName.equals(c.getName()))
.findFirst()
.ifPresent(c -> {
c.getResources().setRequests(Map.of(
"cpu", cpu,
"memory", memory
));
});
return d;
});
log.info("已应用VPA建议到 [{}]: CPU={}, Memory={}",
deploymentName, cpu, memory);
}
}五、预测性伸缩:KEDA + Cron
5.1 KEDA(Kubernetes Event-driven Autoscaling)
KEDA扩展了HPA的能力,支持基于Cron表达式的预测性伸缩:
# 安装KEDA
helm repo add kedacore https://kedacore.github.io/charts
helm install keda kedacore/keda --namespace keda --create-namespace# keda-scaled-object.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: ai-chat-service-keda
namespace: ai-services
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: ai-chat-service
minReplicaCount: 3
maxReplicaCount: 100
triggers:
# 触发器1:Cron预测性伸缩(基于已知的流量规律)
- type: cron
metadata:
timezone: "Asia/Shanghai"
# 早高峰前5分钟预先扩容(早6:55分扩到20个Pod)
start: "55 6 * * 1-5"
end: "0 10 * * 1-5"
desiredReplicas: "20"
- type: cron
metadata:
timezone: "Asia/Shanghai"
# 午高峰前预扩容
start: "55 11 * * 1-5"
end: "0 14 * * 1-5"
desiredReplicas: "15"
- type: cron
metadata:
timezone: "Asia/Shanghai"
# 晚高峰
start: "50 21 * * *"
end: "30 23 * * *"
desiredReplicas: "25"
# 触发器2:Prometheus自定义指标(响应式补充)
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring.svc.cluster.local:9090
metricName: ai_active_requests
threshold: "200" # 总活跃请求超过200时触发
query: |
sum(ai_active_requests{namespace="ai-services",
deployment="ai-chat-service"})
# 触发器3:Kafka队列深度(如果使用消息队列)
- type: kafka
metadata:
bootstrapServers: kafka.ai-services.svc.cluster.local:9092
consumerGroup: ai-task-processor
topic: ai-tasks
lagThreshold: "100" # 队列积压超过100条时扩容六、节点自动伸缩:Karpenter
6.1 Karpenter配置(AWS)
# karpenter-node-pool.yaml
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: ai-gpu-pool
spec:
template:
metadata:
labels:
workload-type: ai-inference
spec:
nodeClassRef:
name: gpu-node-class
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"] # 优先使用Spot实例省钱
- key: node.kubernetes.io/instance-type
operator: In
# 按优先级排序:小GPU先用
values: ["g4dn.xlarge", "g4dn.2xlarge", "g5.xlarge", "g5.2xlarge"]
# 节点自动销毁策略(不用的节点30分钟后销毁)
disruption:
consolidationPolicy: WhenUnderutilized
consolidateAfter: 30m
expireAfter: 24h # 每24小时轮换节点(Spot被抢占前主动替换)
limits:
cpu: 1000
memory: "4000Gi"
"nvidia.com/gpu": 20 # 最多20块GPU
---
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
name: gpu-node-class
spec:
amiFamily: AL2023
role: "KarpenterNodeRole"
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: "ai-cluster"
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: "ai-cluster"
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
volumeSize: 100Gi
volumeType: gp3
# GPU驱动安装
userData: |
#!/bin/bash
# 安装NVIDIA驱动
yum install -y kernel-devel
/etc/eks/bootstrap.sh ai-cluster七、实战调优案例
7.1 确定最优的HPA目标值
// HpaOptimizer.java(分析历史数据,推荐最优HPA配置)
@Service
@Slf4j
public class HpaOptimizer {
private final PrometheusQueryClient prometheusClient;
public HpaRecommendation analyzeAndRecommend(
String deployment, int analyzeDays) {
LocalDateTime end = LocalDateTime.now();
LocalDateTime start = end.minusDays(analyzeDays);
// 查询历史的活跃请求数和副本数
List<MetricPoint> activeRequests = prometheusClient.queryRange(
"avg(ai_active_requests{deployment=\"" + deployment + "\"})",
start, end, Duration.ofMinutes(5));
List<MetricPoint> replicaCount = prometheusClient.queryRange(
"kube_deployment_spec_replicas{deployment=\"" + deployment + "\"}",
start, end, Duration.ofMinutes(5));
// 找到"恰到好处"的时间点:服务正常但资源利用率高
List<Double> optimalTargets = new ArrayList<>();
for (int i = 0; i < activeRequests.size(); i++) {
MetricPoint requests = activeRequests.get(i);
MetricPoint replicas = replicaCount.get(i);
// 如果此时P95延迟正常且副本数>最小值,记录每Pod的请求数
double requestsPerPod = requests.getValue() / replicas.getValue();
optimalTargets.add(requestsPerPod);
}
// 取P85作为HPA的目标值(不要太激进)
DoubleSummaryStatistics stats = optimalTargets.stream()
.mapToDouble(Double::doubleValue)
.summaryStatistics();
double p85 = calculatePercentile(optimalTargets, 85);
return HpaRecommendation.builder()
.deployment(deployment)
.recommendedTargetRequestsPerPod((int) p85)
.minReplicas(Math.max(3, (int)(stats.getMin() * 0.5)))
.maxReplicas((int)(stats.getMax() * 1.5))
.analysis(String.format(
"历史平均每Pod请求数: %.1f, P85: %.1f, P99: %.1f",
stats.getAverage(), p85, calculatePercentile(optimalTargets, 99)))
.build();
}
}7.2 应对Spot实例中断
// SpotInterruptionHandler.java
@Component
@Slf4j
public class SpotInterruptionHandler {
private final KubernetesClient k8sClient;
private final ApplicationEventPublisher eventPublisher;
// 监听AWS Spot中断通知(通过IMDS元数据服务)
@Scheduled(fixedRate = 5000) // 每5秒检查
public void checkSpotInterruptionNotice() {
try {
String response = HttpClient.newHttpClient()
.send(HttpRequest.newBuilder()
.uri(URI.create(
"http://169.254.169.254/latest/meta-data/spot/interruption-notice"))
.timeout(Duration.ofSeconds(1))
.build(),
HttpResponse.BodyHandlers.ofString())
.body();
// 如果收到中断通知(默认不可用),则响应
if (response != null && !response.isEmpty()) {
log.warn("收到Spot中断通知!开始优雅关闭...");
handleInterruption();
}
} catch (Exception e) {
// 正常情况下该URL不可达,忽略
}
}
private void handleInterruption() {
// 1. 停止接受新请求
// 通过更新Pod标签,让Service摘除这个Pod
String podName = System.getenv("HOSTNAME");
k8sClient.pods()
.inNamespace("ai-services")
.withName(podName)
.edit(pod -> {
pod.getMetadata().getLabels().put("draining", "true");
return pod;
});
// 2. 等待当前请求完成(最多2分钟,Spot给2分钟关闭时间)
eventPublisher.publishEvent(new SpotInterruptionEvent(this));
// 3. 触发优雅关闭
SpringApplication.exit(applicationContext, () -> 0);
}
}八、监控与告警配置
# prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: ai-scaling-alerts
namespace: monitoring
spec:
groups:
- name: ai-autoscaling
interval: 30s
rules:
# 告警:HPA已达最大副本数
- alert: HpaAtMaxReplicas
expr: |
kube_horizontalpodautoscaler_status_current_replicas
==
kube_horizontalpodautoscaler_spec_max_replicas
for: 5m
labels:
severity: warning
annotations:
summary: "HPA [{{ $labels.horizontalpodautoscaler }}] 已达最大副本数"
description: "HPA已达最大副本数{{ $value }},可能需要调整maxReplicas"
# 告警:扩容被抑制(HPA想扩但节点资源不足)
- alert: HpaScaleupThrottled
expr: |
kube_horizontalpodautoscaler_status_condition{
condition="AbleToScale",status="false"} == 1
for: 3m
labels:
severity: critical
annotations:
summary: "HPA [{{ $labels.horizontalpodautoscaler }}] 扩容被阻止"
# 告警:活跃AI请求积压
- alert: AiRequestsBacklog
expr: |
sum(ai_active_requests{namespace="ai-services"}) > 500
for: 2m
labels:
severity: warning
annotations:
summary: "AI服务活跃请求积压 {{ $value }} 个"
# 告警:P95延迟过高
- alert: AiHighLatency
expr: |
ai_request_duration_ms{quantile="0.95"} > 5000
for: 3m
labels:
severity: critical
annotations:
summary: "AI服务P95延迟过高:{{ $value }}ms"九、常见问题 FAQ
Q1:HPA和VPA可以同时使用吗?
A:不建议同时使用HPA(按CPU)和VPA(调整CPU资源)——两者会相互干扰。正确的组合:
- HPA:按自定义指标(活跃请求数/Token速率)控制副本数
- VPA:只管理Memory,不管CPU;或使用
updateMode: Off只提供建议
Q2:AI服务的minReplicas设多少合适?
A:计算公式:minReplicas = ceil(正常负载 × 1.5 / 单Pod最大请求数)
- 同时考虑滚动发布时的可用性(至少2个副本保证发布不中断)
- 最低建议3个(一个可以随时重启,两个保持服务)
Q3:如何处理模型预热时间导致的扩容慢问题?
A:
- 预启动:使用init container提前下载/预热模型
- 就绪探针:只有模型完全加载后才标记为Ready
- 预测性伸缩:KEDA Cron提前5-10分钟扩容
- Pod Pool:维持一批"热备"Pod,高峰时直接切换
Q4:Spot实例被抢占会影响AI服务吗?
A:合理处理可以降低影响:
- 设置Pod Disruption Budget(PDB),保证最少N个Pod始终可用
- 给Spot实例设置2分钟的预中断Hook,优雅迁移请求
- 按3:1的比例混用Spot和On-Demand(On-Demand保底稳定性)
Q5:节点GPU利用率低如何诊断?
A:
# 查看GPU利用率
kubectl exec -it <pod-name> -- nvidia-smi
# 如果利用率<50%,可能原因:
# 1. batch_size太小 → 增大vLLM的max_num_seqs
# 2. 并发请求少 → 检查HPA是否扩容不足
# 3. 模型本身小 → 单GPU可以运行多个模型实例十、总结
AI系统弹性伸缩的最佳实践路径:
| 阶段 | 实施内容 | 效果 |
|---|---|---|
| 基础 | CPU/Memory HPA | 防止OOM,但不适合AI |
| 进阶 | 自定义指标HPA | 准确响应AI负载 |
| 高效 | KEDA Cron预测性伸缩 | 提前扩容,消除峰值延迟 |
| 极致 | VPA + Karpenter | 资源利用率最优化 |
孙涛的案例说明:弹性伸缩不只是省钱,更是用对的资源做对的事。57%的成本降低同时换来更好的用户体验——这才是真正的工程优化。
