Kubernetes部署AI应用:云原生AI服务的完整实践
Kubernetes部署AI应用:云原生AI服务的完整实践
故事:一台机器撑不住,一百台又太浪费
2025年初,互联网公司后端工程师王磊负责维护公司的AI内容推荐服务。这个服务平时流量很稳定——但有个严重的波动问题:
白天高峰(10:00-22:00):QPS约1200,需要16核64G的服务器才能稳定承载。
夜间低谷(00:00-06:00):QPS约40,16核64G的机器有97%的资源在空转。
一开始公司配了2台固定服务器(考虑到高可用),每台每月费用约3500元,一年8.4万元。
公司架构师老周建议上K8s HPA(水平自动伸缩)。王磊花了两周时间迁移,结果是:
- 高峰期:自动扩容到12个Pod,服务稳定
- 低谷期:自动缩容到2个Pod,节省资源
- 年化计算节省机器成本:约5.6万元(节省67%)
- 额外收益:零停机部署、故障自愈、资源隔离
这就是K8s对AI应用的核心价值——按需使用,弹性伸缩,不为空闲的资源付钱。
今天,我把这套完整的K8s部署方案从Dockerfile到HPA,一步步拆给你看。
一、整体部署架构
二、Spring AI应用Dockerfile最佳实践
镜像大小直接影响部署速度和存储成本。一个优化差的Spring Boot镜像可能有1.5G,优化后可以压缩到200MB以内。
# Dockerfile - 多阶段构建,优化镜像大小
# ============ 构建阶段 ============
FROM eclipse-temurin:21-jdk-alpine AS builder
WORKDIR /build
# 先复制pom.xml,利用Docker缓存层(依赖不变时不重新下载)
COPY pom.xml .
COPY .mvn/ .mvn/
COPY mvnw .
# 下载依赖(单独一层,变化频率低)
RUN ./mvnw dependency:go-offline -B
# 复制源码并构建
COPY src/ src/
RUN ./mvnw package -DskipTests -B
# 解压JAR为分层结构(Spring Boot Layered JAR)
RUN java -Djarmode=layertools -jar target/*.jar extract
# ============ 运行阶段 ============
FROM eclipse-temurin:21-jre-alpine
# 安全:不以root运行
RUN addgroup -S appgroup && adduser -S appuser -G appgroup
WORKDIR /app
# 按变化频率从低到高分层(充分利用镜像缓存)
COPY --from=builder /build/dependencies/ ./
COPY --from=builder /build/spring-boot-loader/ ./
COPY --from=builder /build/snapshot-dependencies/ ./
COPY --from=builder /build/application/ ./
# 安全:切换到非root用户
USER appuser
# JVM优化参数(容器感知)
ENV JAVA_OPTS="\
-XX:+UseContainerSupport \
-XX:MaxRAMPercentage=75.0 \
-XX:InitialRAMPercentage=50.0 \
-XX:+UseG1GC \
-XX:MaxGCPauseMillis=200 \
-Djava.security.egd=file:/dev/./urandom \
-Dspring.profiles.active=${SPRING_PROFILES_ACTIVE:prod}"
EXPOSE 8080
# 健康检查
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
CMD wget -q --spider http://localhost:8080/actuator/health || exit 1
ENTRYPOINT ["sh", "-c", "exec java $JAVA_OPTS org.springframework.boot.loader.launch.JarLauncher"]镜像大小对比:
| 构建方式 | 镜像大小 | 说明 |
|---|---|---|
| 单阶段构建(JDK + Fat JAR) | ~680MB | 包含完整JDK和编译工具 |
| 多阶段构建(JRE + Fat JAR) | ~320MB | 只有JRE,无编译工具 |
| 多阶段构建(JRE + 分层JAR) | ~240MB | 分层利用缓存,大小略小 |
| 使用Jlink定制JRE | ~180MB | 只打包需要的JDK模块 |
# 构建并推送镜像
docker build -t registry.company.com/ai-service:v1.2.3 .
docker push registry.company.com/ai-service:v1.2.3
# 验证镜像安全性(检查漏洞)
trivy image registry.company.com/ai-service:v1.2.3三、Kubernetes Deployment配置
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: ai-service
namespace: ai-platform
labels:
app: ai-service
version: v1.2.3
team: platform
spec:
replicas: 3 # 初始副本数,HPA会动态调整
selector:
matchLabels:
app: ai-service
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # 滚动更新时最多多出1个Pod
maxUnavailable: 0 # 滚动更新时不允许不可用(零停机)
template:
metadata:
labels:
app: ai-service
version: v1.2.3
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/actuator/prometheus"
spec:
# 优雅终止:给应用60秒处理完进行中的请求
terminationGracePeriodSeconds: 60
# Pod反亲和性:同一服务的Pod分散到不同节点
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values: ["ai-service"]
topologyKey: kubernetes.io/hostname
containers:
- name: ai-service
image: registry.company.com/ai-service:v1.2.3
imagePullPolicy: IfNotPresent
ports:
- containerPort: 8080
name: http
# 环境变量(非敏感)
env:
- name: SPRING_PROFILES_ACTIVE
value: "prod"
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
# 从ConfigMap挂载配置
envFrom:
- configMapRef:
name: ai-service-config
# 从Secret挂载敏感配置
- secretRef:
name: ai-service-secrets
# 资源限制(关键!不设置限制会影响其他服务)
resources:
requests:
cpu: "500m" # 请求0.5核CPU(用于调度)
memory: "1Gi" # 请求1G内存
limits:
cpu: "2000m" # 最多使用2核CPU
memory: "3Gi" # 最多使用3G内存(JVM + AI模型内存)
# 就绪探针:Pod就绪才接收流量
readinessProbe:
httpGet:
path: /actuator/health/readiness
port: 8080
initialDelaySeconds: 30 # 启动后30秒开始检查
periodSeconds: 10 # 每10秒检查一次
failureThreshold: 3 # 连续3次失败标记为不就绪
successThreshold: 1
# 存活探针:Pod不存活则重启
livenessProbe:
httpGet:
path: /actuator/health/liveness
port: 8080
initialDelaySeconds: 60 # 启动后60秒开始检查(AI服务启动慢)
periodSeconds: 30
failureThreshold: 3
successThreshold: 1
# 启动探针:只在启动阶段检查(避免启动慢被误杀)
startupProbe:
httpGet:
path: /actuator/health
port: 8080
failureThreshold: 20 # 给20*15=300秒启动时间
periodSeconds: 15
# 优雅关闭(配合terminationGracePeriodSeconds)
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 5"] # 等待负载均衡摘除
# 镜像拉取认证
imagePullSecrets:
- name: registry-secret四、ConfigMap与Secret管理
# configmap.yaml - 非敏感配置
apiVersion: v1
kind: ConfigMap
metadata:
name: ai-service-config
namespace: ai-platform
data:
# 数据库配置
SPRING_DATASOURCE_URL: "jdbc:mysql://mysql-service:3306/aiplatform?useSSL=true"
SPRING_DATASOURCE_USERNAME: "aiapp"
# Redis配置
SPRING_DATA_REDIS_HOST: "redis-service"
SPRING_DATA_REDIS_PORT: "6379"
# AI服务配置(非敏感)
SPRING_AI_OPENAI_BASE_URL: "https://api.openai.com"
AI_SERVICE_MAX_TOKENS: "2048"
AI_SERVICE_TEMPERATURE: "0.7"
# 向量数据库
MILVUS_HOST: "milvus-service"
MILVUS_PORT: "19530"
# 应用配置
SERVER_TOMCAT_MAX_THREADS: "200"
MANAGEMENT_ENDPOINTS_WEB_EXPOSURE_INCLUDE: "health,info,prometheus,metrics"# secret.yaml - 敏感配置(实际生产使用Vault或云厂商KMS)
# 注意:Secret的value必须是base64编码
apiVersion: v1
kind: Secret
metadata:
name: ai-service-secrets
namespace: ai-platform
type: Opaque
data:
# echo -n "your-api-key" | base64
OPENAI_API_KEY: <base64编码的API Key>
SPRING_DATASOURCE_PASSWORD: <base64编码的数据库密码>
SPRING_DATA_REDIS_PASSWORD: <base64编码的Redis密码>
JWT_SECRET: <base64编码的JWT密钥>// 对应的Spring Boot配置(自动从环境变量读取)
@Configuration
@ConfigurationProperties(prefix = "ai.service")
@Data
public class AIServiceConfig {
// 从环境变量 AI_SERVICE_MAX_TOKENS 自动注入
private int maxTokens = 2048;
private double temperature = 0.7;
private String model = "gpt-4o";
// OpenAI API Key从环境变量 OPENAI_API_KEY 注入
// 在application.yml中:
// spring.ai.openai.api-key: ${OPENAI_API_KEY}
}
// 健康检查端点(K8s探针使用)
@Component
public class AIServiceHealthIndicator implements HealthIndicator {
@Autowired
private ChatClient chatClient;
@Override
public Health health() {
try {
// 简单的连通性检查(不实际调用LLM,只检查配置)
if (chatClient != null) {
return Health.up()
.withDetail("llm", "connected")
.withDetail("timestamp", System.currentTimeMillis())
.build();
}
return Health.down().withDetail("reason", "LLM client not initialized").build();
} catch (Exception e) {
return Health.down().withException(e).build();
}
}
}五、HPA弹性伸缩配置
# hpa.yaml - 基于CPU和自定义指标的弹性伸缩
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: ai-service-hpa
namespace: ai-platform
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: ai-service
minReplicas: 2 # 最少2个Pod(高可用)
maxReplicas: 20 # 最多20个Pod
metrics:
# 指标1:CPU使用率(最常用)
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70 # CPU超过70%则扩容
# 指标2:内存使用率
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
# 指标3:自定义指标 - HTTP请求队列深度(需要Prometheus Adapter)
- type: Pods
pods:
metric:
name: http_requests_waiting # Prometheus中的自定义指标
target:
type: AverageValue
averageValue: "50" # 平均每Pod排队请求超过50则扩容
# 扩缩容行为控制(防止频繁扩缩容)
behavior:
scaleUp:
stabilizationWindowSeconds: 0 # 立即扩容(流量突增时快速响应)
policies:
- type: Percent
value: 100
periodSeconds: 60 # 每60秒最多扩容100%(翻倍)
- type: Pods
value: 4
periodSeconds: 60 # 每60秒最多增加4个Pod
scaleDown:
stabilizationWindowSeconds: 300 # 缩容稳定窗口5分钟(避免抖动)
policies:
- type: Percent
value: 20
periodSeconds: 60 # 每60秒最多缩容20%(缓慢缩容)# 基于LLM调用延迟的自定义HPA(更适合AI服务)
# 需要安装prometheus-adapter
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: ai-service-hpa-latency
namespace: ai-platform
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: ai-service
minReplicas: 2
maxReplicas: 20
metrics:
- type: Object
object:
metric:
name: ai_llm_response_time_p99_seconds
describedObject:
apiVersion: apps/v1
kind: Deployment
name: ai-service
target:
type: Value
value: "5" # P99延迟超过5秒则扩容六、向量数据库持久化存储(PVC)
# pvc.yaml - Milvus向量数据库存储
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: milvus-data-pvc
namespace: ai-platform
spec:
accessModes:
- ReadWriteOnce # 单节点读写(单机Milvus)
storageClassName: ssd-sc # 使用SSD存储类(向量检索IO密集)
resources:
requests:
storage: 100Gi
---
# StorageClass定义(SSD高性能存储)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: ssd-sc
provisioner: kubernetes.io/aws-ebs # 根据云厂商修改
parameters:
type: gp3
iops: "3000"
throughput: "125"
reclaimPolicy: Retain # 重要:PVC删除后数据保留
allowVolumeExpansion: true
---
# Milvus StatefulSet(有状态服务用StatefulSet)
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: milvus
namespace: ai-platform
spec:
serviceName: milvus-headless
replicas: 1
selector:
matchLabels:
app: milvus
template:
metadata:
labels:
app: milvus
spec:
containers:
- name: milvus
image: milvusdb/milvus:v2.4.0
command: ["milvus", "run", "standalone"]
ports:
- containerPort: 19530
name: grpc
- containerPort: 9091
name: metrics
resources:
requests:
cpu: "1000m"
memory: "4Gi"
limits:
cpu: "4000m"
memory: "8Gi"
volumeMounts:
- name: milvus-data
mountPath: /var/lib/milvus
readinessProbe:
httpGet:
path: /healthz
port: 9091
initialDelaySeconds: 30
periodSeconds: 15
volumeClaimTemplates: # StatefulSet自动创建PVC
- metadata:
name: milvus-data
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: ssd-sc
resources:
requests:
storage: 100Gi七、Service与Ingress配置
# service.yaml
apiVersion: v1
kind: Service
metadata:
name: ai-service
namespace: ai-platform
labels:
app: ai-service
spec:
selector:
app: ai-service
ports:
- port: 80
targetPort: 8080
name: http
type: ClusterIP # 集群内部访问,通过Ingress暴露外部
---
# ingress.yaml - 外部访问入口
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: ai-service-ingress
namespace: ai-platform
annotations:
# Nginx Ingress配置
nginx.ingress.kubernetes.io/proxy-read-timeout: "300" # AI接口超时300秒
nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
nginx.ingress.kubernetes.io/proxy-body-size: "10m" # 最大请求体10M
# 限流(防止AI服务被打爆)
nginx.ingress.kubernetes.io/limit-rps: "100" # 每秒100请求
nginx.ingress.kubernetes.io/limit-connections: "50"
# CORS(如果前端在不同域)
nginx.ingress.kubernetes.io/enable-cors: "true"
nginx.ingress.kubernetes.io/cors-allow-origin: "https://app.company.com"
# HTTPS重定向
nginx.ingress.kubernetes.io/ssl-redirect: "true"
# 证书(Let's Encrypt)
cert-manager.io/cluster-issuer: "letsencrypt-prod"
spec:
tls:
- hosts:
- ai-api.company.com
secretName: ai-service-tls
rules:
- host: ai-api.company.com
http:
paths:
- path: /api/ai
pathType: Prefix
backend:
service:
name: ai-service
port:
number: 80
# 流式接口单独配置(更长的超时)
- path: /api/ai/stream
pathType: Prefix
backend:
service:
name: ai-service
port:
number: 80八、Spring Boot应用的K8s优化配置
// 优雅关闭配置(配合K8s preStop钩子)
@Configuration
public class GracefulShutdownConfig {
@Bean
public GracefulShutdown gracefulShutdown() {
return new GracefulShutdown();
}
@Bean
public ConfigurableServletWebServerFactory webServerFactory(GracefulShutdown gracefulShutdown) {
TomcatServletWebServerFactory factory = new TomcatServletWebServerFactory();
factory.addConnectorCustomizers(gracefulShutdown);
return factory;
}
}
// application.yml - K8s环境配置# application-prod.yml
server:
port: 8080
shutdown: graceful # 开启优雅关闭
spring:
lifecycle:
timeout-per-shutdown-phase: 50s # 等待请求完成最多50秒
# 数据源连接池配置(适配K8s环境)
datasource:
hikari:
maximum-pool-size: 20 # 连接池大小(每Pod)
minimum-idle: 5
connection-timeout: 30000
idle-timeout: 600000
max-lifetime: 1800000
# K8s Pod重启后重新建立连接
keepalive-time: 30000
# Redis配置
data:
redis:
lettuce:
pool:
max-active: 20
max-idle: 10
min-idle: 5
max-wait: 3000ms
management:
endpoints:
web:
exposure:
include: health,info,prometheus,metrics
endpoint:
health:
show-details: when-authorized
# K8s探针分离(就绪和存活独立检查)
probes:
enabled: true
health:
livenessState:
enabled: true
readinessState:
enabled: true
# AI服务配置
spring:
ai:
openai:
api-key: ${OPENAI_API_KEY}
chat:
options:
model: gpt-4o
temperature: 0.7九、监控与告警配置
# 在Prometheus中配置AI服务监控规则
# prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: ai-service-rules
namespace: ai-platform
spec:
groups:
- name: ai-service-alerts
interval: 30s
rules:
# 告警1:错误率过高
- alert: AIServiceHighErrorRate
expr: |
sum(rate(http_server_requests_seconds_count{app="ai-service",status=~"5.."}[5m]))
/
sum(rate(http_server_requests_seconds_count{app="ai-service"}[5m])) > 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "AI服务错误率超过5%"
description: "当前错误率: {{ $value | humanizePercentage }}"
# 告警2:响应延迟过高
- alert: AIServiceHighLatency
expr: |
histogram_quantile(0.99,
sum(rate(http_server_requests_seconds_bucket{app="ai-service"}[5m])) by (le)
) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "AI服务P99延迟超过10秒"
# 告警3:Pod数量接近HPA上限(可能需要扩容资源)
- alert: AIServiceNearHPALimit
expr: |
kube_deployment_status_replicas{deployment="ai-service"} >=
kube_horizontalpodautoscaler_status_current_replicas{horizontalpodautoscaler="ai-service-hpa"} * 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "AI服务Pod数接近HPA上限,可能需要增加maxReplicas"
# 告警4:内存使用率过高(可能有内存泄漏)
- alert: AIServiceHighMemory
expr: |
container_memory_usage_bytes{pod=~"ai-service.*", container="ai-service"}
/
container_spec_memory_limit_bytes{pod=~"ai-service.*", container="ai-service"} > 0.85
for: 5m
labels:
severity: warning
annotations:
summary: "AI服务内存使用超过85%"十、资源成本对比分析
实际成本数据(某互联网公司AI推荐服务,6个月统计):
| 方案 | 月均机器费用 | 资源利用率 | 高峰QPS | 故障恢复时间 |
|---|---|---|---|---|
| 固定云主机(2台16C64G) | 14,000元 | 28% | 1,200 | 5分钟(人工) |
| K8s + HPA(动态弹性) | 5,200元 | 73% | 1,200 | 30秒(自动) |
| 节省 | 8,800元/月 | +45% | 同等承载 | 降低90% |
HPA伸缩记录(典型工作日):
| 时段 | QPS | Pod数量 | CPU使用率 |
|---|---|---|---|
| 00:00-06:00 | 40-80 | 2 | 15% |
| 06:00-09:00 | 200-400 | 3-4 | 45% |
| 09:00-18:00 | 800-1200 | 8-12 | 65-72% |
| 18:00-22:00 | 1000-1400 | 10-14 | 70-78% |
| 22:00-24:00 | 300-500 | 4-5 | 40% |
十一、生产运维常用命令
# ============ 日常查看命令 ============
# 查看AI服务Pod状态
kubectl get pods -n ai-platform -l app=ai-service
# 查看HPA状态(实时伸缩状态)
kubectl get hpa -n ai-platform ai-service-hpa -w
# 查看Pod日志(实时)
kubectl logs -n ai-platform -l app=ai-service -f --tail=100
# 查看Pod资源使用
kubectl top pods -n ai-platform -l app=ai-service
# ============ 部署操作 ============
# 更新镜像版本(触发滚动更新)
kubectl set image deployment/ai-service \
ai-service=registry.company.com/ai-service:v1.2.4 \
-n ai-platform
# 查看滚动更新进度
kubectl rollout status deployment/ai-service -n ai-platform
# 快速回滚到上一版本
kubectl rollout undo deployment/ai-service -n ai-platform
# 回滚到指定版本
kubectl rollout undo deployment/ai-service --to-revision=2 -n ai-platform
# ============ 故障排查 ============
# 查看Pod详情(OOMKilled/CrashLoopBackOff等问题)
kubectl describe pod <pod-name> -n ai-platform
# 进入Pod调试
kubectl exec -it <pod-name> -n ai-platform -- sh
# 临时扩容(紧急时手动覆盖HPA)
kubectl scale deployment ai-service --replicas=15 -n ai-platform
# 查看ConfigMap内容
kubectl get configmap ai-service-config -n ai-platform -o yaml
# 热更新ConfigMap(部分配置不需要重启Pod)
kubectl edit configmap ai-service-config -n ai-platform十二、NetworkPolicy:AI服务网络隔离
AI服务通常需要调用外部LLM API,同时又不能被随意访问。NetworkPolicy实现精细化的网络访问控制。
# networkpolicy.yaml - AI服务网络隔离策略
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: ai-service-netpol
namespace: ai-platform
spec:
podSelector:
matchLabels:
app: ai-service
# 入站规则
ingress:
# 只允许Ingress Controller访问
- from:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: ingress-nginx
- podSelector:
matchLabels:
app.kubernetes.io/name: ingress-nginx
ports:
- protocol: TCP
port: 8080
# 允许Prometheus采集指标
- from:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: monitoring
ports:
- protocol: TCP
port: 8080
# 出站规则
egress:
# 允许访问集群内部服务
- to:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: ai-platform
ports:
- protocol: TCP
port: 3306 # MySQL
- protocol: TCP
port: 6379 # Redis
- protocol: TCP
port: 19530 # Milvus
# 允许访问外部LLM API(HTTPS)
- to:
- ipBlock:
cidr: 0.0.0.0/0
except:
- 10.0.0.0/8 # 排除内网段
- 172.16.0.0/12
- 192.168.0.0/16
ports:
- protocol: TCP
port: 443 # HTTPS only
# 允许DNS
- to:
- namespaceSelector: {}
ports:
- protocol: UDP
port: 53十三、LimitRange与ResourceQuota:资源治理
多团队共享K8s集群时,必须设置资源配额防止某个服务"吃掉"所有资源。
# resource-quota.yaml - 命名空间资源配额
apiVersion: v1
kind: ResourceQuota
metadata:
name: ai-platform-quota
namespace: ai-platform
spec:
hard:
# 计算资源上限
requests.cpu: "20" # 整个命名空间CPU请求上限20核
requests.memory: "40Gi"
limits.cpu: "40"
limits.memory: "80Gi"
# Pod数量限制
count/pods: "50"
count/deployments.apps: "20"
count/services: "20"
# 存储限制
requests.storage: "500Gi"
persistentvolumeclaims: "20"
---
# limitrange.yaml - Pod默认资源限制
apiVersion: v1
kind: LimitRange
metadata:
name: ai-platform-limits
namespace: ai-platform
spec:
limits:
# Container默认限制(没设置resources的容器自动应用)
- type: Container
default:
cpu: "500m"
memory: "512Mi"
defaultRequest:
cpu: "100m"
memory: "128Mi"
max:
cpu: "4"
memory: "8Gi"
min:
cpu: "50m"
memory: "64Mi"
# Pod整体限制
- type: Pod
max:
cpu: "8"
memory: "16Gi"十四、Kustomize多环境配置管理
k8s/
├── base/ # 基础配置
│ ├── kustomization.yaml
│ ├── deployment.yaml
│ ├── service.yaml
│ └── configmap.yaml
└── overlays/
├── dev/
│ ├── kustomization.yaml
│ └── patch-replicas.yaml # 覆盖副本数
├── staging/
│ ├── kustomization.yaml
│ └── patch-resources.yaml
└── prod/
├── kustomization.yaml
├── patch-resources.yaml
└── patch-hpa.yaml# k8s/base/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- deployment.yaml
- service.yaml
- configmap.yaml
- ingress.yaml
commonLabels:
app: ai-service
managed-by: kustomize
---
# k8s/overlays/dev/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
bases:
- ../../base
# 开发环境特定配置覆盖
patchesStrategicMerge:
- patch-dev.yaml
# 开发环境ConfigMap覆盖
configMapGenerator:
- name: ai-service-config
behavior: merge
literals:
- SPRING_AI_OPENAI_CHAT_OPTIONS_MODEL=gpt-4o-mini # 开发用小模型
- AI_SERVICE_MAX_TOKENS=1024
- SPRING_PROFILES_ACTIVE=dev
---
# k8s/overlays/prod/patch-resources.yaml
# 生产环境资源配置覆盖
apiVersion: apps/v1
kind: Deployment
metadata:
name: ai-service
spec:
replicas: 6
template:
spec:
containers:
- name: ai-service
resources:
requests:
cpu: "1000m"
memory: "2Gi"
limits:
cpu: "3000m"
memory: "4Gi"# 部署命令
# 开发环境
kubectl apply -k k8s/overlays/dev
# 生产环境
kubectl apply -k k8s/overlays/prod
# 查看最终生成的配置(不实际部署)
kubectl kustomize k8s/overlays/prodFAQ
Q1:AI服务的内存限制怎么设置?
A:JVM + 模型推理内存 + 对话上下文内存都要考虑。一般公式:memory limit = 堆内存(Xmx) × 1.3 + 非堆内存 + 缓存。建议先不设限制跑一段时间,观察实际使用后再设置limits为峰值的1.2倍。
Q2:HPA基于CPU扩容,但AI服务CPU不高(是I/O等待)怎么办?
A:用自定义指标。把http_requests_waiting(排队请求数)或llm_response_time_p99(LLM调用延迟)暴露给Prometheus,然后配置基于这些业务指标的HPA。这比CPU指标更能反映AI服务的真实负载。
Q3:向量数据库(Milvus)用StatefulSet,数据怎么备份?
A:两种方案:①定期通过Milvus的bulk export导出数据到对象存储(S3/OSS) ②使用K8s的VolumeSnapshot对PVC做快照。建议每天深夜执行一次全量备份。
Q4:多个环境(dev/test/prod)的配置怎么管理?
A:用Kustomize(K8s原生)或Helm Chart。Kustomize的overlays/dev、overlays/prod分别覆盖基础配置。Secret不放Git,用Vault或云厂商的Secret Manager。
Q5:Pod OOMKilled怎么处理?
A:先分析:kubectl describe pod <name>查看OOMKilled原因。原因通常是:①JVM堆内存设置过大超过limits ②有内存泄漏 ③模型加载占用超预期。解决:增加memory limits,或用-XX:MaxRAMPercentage=75.0让JVM自适应容器内存。
十五、生产级K8s运维检查清单
上线前必须逐项确认的检查清单,来自真实踩坑经验总结:
容器化层面:
# 检查1:镜像是否以非root用户运行
docker inspect registry.company.com/ai-service:v1.2.3 \
--format='{{.Config.User}}'
# 期望:appuser(非root)
# 检查2:镜像大小
docker images registry.company.com/ai-service:v1.2.3 \
--format='{{.Size}}'
# 期望:< 400MB
# 检查3:不含敏感信息
docker history registry.company.com/ai-service:v1.2.3 --no-trunc \
| grep -i "api.key\|password\|secret"
# 期望:无输出Kubernetes层面:
# 检查4:所有Pod有资源限制
kubectl get pods -n ai-platform -o json | \
jq '.items[].spec.containers[].resources.limits // "MISSING"'
# 检查5:探针配置正常
kubectl get deployment ai-service -n ai-platform -o jsonpath=\
'{.spec.template.spec.containers[0].readinessProbe}'
# 检查6:HPA状态正常
kubectl describe hpa ai-service-hpa -n ai-platform
# 检查7:PodDisruptionBudget存在
kubectl get pdb -n ai-platform
# 检查8:所有Secret已创建
kubectl get secrets -n ai-platform | grep ai-service
# 检查9:资源请求合理(不超过节点总量的50%)
kubectl describe nodes | grep -A 5 "Allocated resources"应用层面:
# 检查10:健康端点响应正常
kubectl port-forward -n ai-platform svc/ai-service 8080:80 &
curl http://localhost:8080/actuator/health | jq '.status'
# 期望:"UP"
# 检查11:Prometheus指标正常暴露
curl http://localhost:8080/actuator/prometheus | grep "jvm_memory"
# 检查12:日志格式正确(JSON格式方便ELK收集)
kubectl logs -n ai-platform -l app=ai-service --tail=5 | \
python3 -c "import json,sys; [json.loads(l) for l in sys.stdin]"安全层面:
# 检查13:NetworkPolicy已配置
kubectl get networkpolicy -n ai-platform
# 检查14:镜像无高危漏洞
trivy image registry.company.com/ai-service:v1.2.3 \
--severity CRITICAL --exit-code 1
# 检查15:RBAC最小权限
kubectl auth can-i --list --as=system:serviceaccount:ai-platform:ai-service十六、性能基准测试
在K8s上线前,必须进行性能基准测试,确认HPA的扩缩容参数设置合理。
// 性能测试工具类(JMeter或Gatling的Java版)
@Component
@Slf4j
public class PerformanceBenchmark {
@Autowired
private ChatClient chatClient;
/**
* 并发压力测试:验证每个Pod的承载能力
*/
public BenchmarkResult runConcurrencyTest(int concurrency, int totalRequests) throws InterruptedException {
ExecutorService executor = Executors.newFixedThreadPool(concurrency);
CountDownLatch latch = new CountDownLatch(totalRequests);
AtomicInteger successCount = new AtomicInteger(0);
AtomicInteger errorCount = new AtomicInteger(0);
List<Long> responseTimes = Collections.synchronizedList(new ArrayList<>());
long startTime = System.currentTimeMillis();
for (int i = 0; i < totalRequests; i++) {
executor.submit(() -> {
long reqStart = System.currentTimeMillis();
try {
chatClient.prompt()
.user("你好,请介绍一下你自己")
.call()
.content();
successCount.incrementAndGet();
responseTimes.add(System.currentTimeMillis() - reqStart);
} catch (Exception e) {
errorCount.incrementAndGet();
log.debug("Request failed", e);
} finally {
latch.countDown();
}
});
}
latch.await(5, TimeUnit.MINUTES);
executor.shutdown();
long totalTime = System.currentTimeMillis() - startTime;
Collections.sort(responseTimes);
return BenchmarkResult.builder()
.concurrency(concurrency)
.totalRequests(totalRequests)
.successCount(successCount.get())
.errorCount(errorCount.get())
.totalTimeMs(totalTime)
.throughputQps((double) successCount.get() / totalTime * 1000)
.avgResponseMs(responseTimes.stream().mapToLong(Long::longValue).average().orElse(0))
.p50ResponseMs(responseTimes.get((int)(responseTimes.size() * 0.5)))
.p99ResponseMs(responseTimes.get((int)(responseTimes.size() * 0.99)))
.build();
}
}实测性能基准(单Pod,2C4G配置):
| 并发数 | 成功率 | 平均响应(ms) | P99响应(ms) | QPS |
|---|---|---|---|---|
| 5 | 100% | 1,820 | 3,200 | 2.7 |
| 10 | 100% | 2,340 | 5,100 | 4.3 |
| 20 | 99.3% | 4,180 | 9,800 | 4.8 |
| 30 | 97.1% | 8,920 | 22,000 | 3.4 |
结论:单Pod最佳并发为10-15,超过20并发后P99急剧上升。因此HPA的CPU阈值设为65%是合理的(此时并发约12,处于最优区间)。
结语
K8s对AI应用的价值,不仅仅是节省机器成本,更是改变了运维模式。
固定服务器时代,扩容是人的事:发现流量高了,申请机器、配置、部署,可能要等几小时甚至几天。K8s HPA时代,扩容是系统自己的事:CPU高了30秒内自动多起几个Pod,流量降了慢慢缩回去,工程师睡觉都不用担心。
从运维视角看:K8s学习成本确实不低,但一旦掌握,它给你的回报是真实的——更高的资源利用率、更快的故障自愈、更安全的部署流程。
这不是"用了K8s就牛了"的噱头,而是云原生时代Java工程师的必备技能。
