第1653篇:Helm Chart封装AI服务——标准化部署与多环境配置管理
第1653篇:Helm Chart封装AI服务——标准化部署与多环境配置管理
最近帮一个团队review他们的部署方案,发现一个经典问题:开发、测试、预发、生产四套环境,每套都有一份独立的YAML文件,这些文件大部分内容一模一样,只有模型路径、副本数、资源限制有差异。
每次发版需要同时改四份文件,经常出现"预发改了但生产忘改"或者"改了测试环境的YAML不小心改错了生产的"这类低级错误。
这不是个例,是行业里大多数团队在K8s部署早期都会经历的阶段。解法也不复杂:用Helm。但怎么设计一个AI服务专用的Helm Chart,让它真正好用而不是形式上引入Helm,这需要讲清楚。
为什么AI服务的Helm Chart比Web服务复杂
Helm是K8s的包管理工具,它的核心思想是把Kubernetes YAML里的变化部分抽取成模板变量,通过values.yaml来管理不同环境的差异。
对于普通Web服务,变化的部分通常就是镜像Tag、副本数、环境变量,Helm很容易搞定。AI服务的变化项要复杂很多:
- 模型文件路径或者模型版本
- 推理引擎配置(TensorRT、ONNX Runtime、vLLM等的参数各不相同)
- GPU资源分配(不同环境GPU型号不一样,资源限制不一样)
- 模型加载策略(开发环境不需要加载完整模型)
- A/B测试分流比例
- 外部依赖(向量数据库地址、Embedding服务地址)
而且AI服务通常不是单一部署,一套完整的AI应用可能包含:推理服务、Embedding服务、向量数据库、文档解析服务、API网关……这些服务之间有依赖关系,部署顺序和配置都需要协调。
Helm Chart的目录结构设计
我们设计的AI服务Helm Chart结构如下:
ai-inference-chart/
├── Chart.yaml # Chart元数据
├── values.yaml # 默认values(通常是开发环境配置)
├── values-test.yaml # 测试环境覆盖值
├── values-staging.yaml # 预发环境覆盖值
├── values-prod.yaml # 生产环境覆盖值
├── templates/
│ ├── _helpers.tpl # 公共模板函数
│ ├── deployment.yaml # 推理服务Deployment
│ ├── service.yaml # Service
│ ├── hpa.yaml # HPA配置
│ ├── configmap.yaml # 应用配置
│ ├── secret.yaml # 敏感配置(通常只是占位,实际从外部注入)
│ ├── pvc.yaml # 模型文件存储
│ ├── serviceaccount.yaml # ServiceAccount
│ ├── rbac.yaml # RBAC配置
│ ├── pdb.yaml # PodDisruptionBudget
│ └── ingress.yaml # Ingress(可选)
├── charts/ # 子Chart(依赖的其他服务)
└── README.md这个结构有几个设计思路:
一是把values按环境分层,values.yaml作为基础配置,各环境的文件只写差异部分。部署时通过-f values-prod.yaml来覆盖。
二是敏感信息(Secret)在Chart里只写结构,实际内容由CI/CD注入,不能把真实密钥放进Chart。
三是把模型相关的配置单独抽出来,因为模型升级频率和代码升级频率不一样。
Chart.yaml
apiVersion: v2
name: ai-inference
description: 企业级AI推理服务Helm Chart
type: application
version: 1.3.0 # Chart版本,每次修改Chart结构时递增
appVersion: "2.1.0" # 应用版本,跟随服务镜像版本
keywords:
- ai
- inference
- llm
- machine-learning
maintainers:
- name: ai-platform-team
email: ai-platform@company.com
dependencies:
- name: prometheus-servicemonitor
version: "0.1.0"
repository: "https://prometheus-community.github.io/helm-charts"
condition: monitoring.enabledvalues.yaml的分层设计
这是Helm Chart的核心,设计得好不好直接影响可维护性。
# values.yaml — 默认配置(适合开发环境)
# 全局配置
global:
environment: dev
imageRegistry: "your-registry.io"
imagePullSecrets:
- name: registry-credentials
# 服务基本信息
service:
name: llm-inference
port: 8080
type: ClusterIP
# 镜像配置
image:
repository: "your-registry.io/ai/llm-inference"
tag: "latest"
pullPolicy: Always
# 副本与伸缩配置
replicaCount: 1
autoscaling:
enabled: false
minReplicas: 1
maxReplicas: 5
targetQueueDepth: 10 # 队列深度触发扩容的阈值
targetCpuUtilization: 70
# 模型配置
model:
name: "llm-7b"
version: "v2.0"
source: "local" # local | s3 | oss | huggingface
localPath: "/models"
s3Bucket: ""
s3Key: ""
# 模型加载策略
lazyLoad: true # 开发环境延迟加载,减少启动时间
warmupOnStart: false
# 推理引擎配置
inference:
engine: "vllm" # vllm | triton | tgi
maxBatchSize: 4
maxSequenceLength: 2048
maxConcurrentRequests: 10
timeout: 120 # 单次推理超时秒数
# vLLM特有配置
vllm:
tensorParallelSize: 1
gpuMemoryUtilization: 0.85
maxNumSeqs: 256
# 资源配置
resources:
requests:
cpu: "2"
memory: "8Gi"
limits:
cpu: "4"
memory: "16Gi"
# GPU配置(开发环境默认不申请GPU)
gpu:
enabled: false
count: 0
type: ""
# 节点调度配置
nodeSelector: {}
tolerations: []
affinity: {}
# 健康检查
healthCheck:
startupProbe:
enabled: true
initialDelaySeconds: 60 # AI服务启动慢,给足够时间
periodSeconds: 10
failureThreshold: 30 # 允许5分钟内启动完成
livenessProbe:
path: /actuator/health/liveness
initialDelaySeconds: 0
periodSeconds: 30
timeoutSeconds: 10
failureThreshold: 3
readinessProbe:
path: /actuator/health/readiness
initialDelaySeconds: 0
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 6
# 存储配置
persistence:
enabled: false # 开发环境不挂PVC
storageClass: ""
size: "50Gi"
accessMode: ReadOnlyMany # 模型文件多Pod只读共享
# 外部依赖配置
vectorDB:
type: "milvus"
host: "milvus-standalone"
port: 19530
collectionName: "knowledge_base"
embeddingService:
url: "http://embedding-service:8080"
model: "bge-large-zh"
# ConfigMap里的应用配置
config:
logging:
level: DEBUG
format: json
metrics:
enabled: true
port: 9090
tracing:
enabled: false
endpoint: ""
# 监控配置
monitoring:
enabled: false
serviceMonitor:
interval: 30s
scrapeTimeout: 10s
# PDB配置
podDisruptionBudget:
enabled: false
minAvailable: 1生产环境的覆盖文件(values-prod.yaml)就只需要写差异部分,大大减少重复:
# values-prod.yaml — 生产环境覆盖
global:
environment: prod
replicaCount: 4
autoscaling:
enabled: true
minReplicas: 4
maxReplicas: 20
targetQueueDepth: 5
model:
source: "oss"
s3Bucket: "ai-models-prod"
s3Key: "llm-7b/v2.0/model.bin"
lazyLoad: false
warmupOnStart: true
inference:
maxBatchSize: 8
maxConcurrentRequests: 50
resources:
requests:
cpu: "4"
memory: "16Gi"
limits:
cpu: "8"
memory: "32Gi"
gpu:
enabled: true
count: 1
type: "nvidia-tesla-t4"
nodeSelector:
node-role: inference
gpu-type: t4
tolerations:
- key: "workload"
operator: "Equal"
value: "inference"
effect: "NoSchedule"
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app.kubernetes.io/name
operator: In
values:
- llm-inference
topologyKey: kubernetes.io/hostname
healthCheck:
startupProbe:
initialDelaySeconds: 120
failureThreshold: 60
persistence:
enabled: true
storageClass: "fast-ssd"
size: "200Gi"
vectorDB:
host: "milvus-prod-cluster.internal"
embeddingService:
url: "http://embedding-service-prod:8080"
config:
logging:
level: INFO
tracing:
enabled: true
endpoint: "http://jaeger-collector:14268/api/traces"
monitoring:
enabled: true
podDisruptionBudget:
enabled: true
minAvailable: 2核心模板文件
_helpers.tpl:模板辅助函数
这个文件定义可复用的模板片段,避免在多个模板里重复写相同的逻辑:
{{/* 生成标准标签 */}}
{{- define "ai-inference.labels" -}}
helm.sh/chart: {{ include "ai-inference.chart" . }}
app.kubernetes.io/name: {{ include "ai-inference.name" . }}
app.kubernetes.io/instance: {{ .Release.Name }}
app.kubernetes.io/version: {{ .Chart.AppVersion | quote }}
app.kubernetes.io/managed-by: {{ .Release.Service }}
environment: {{ .Values.global.environment }}
{{- end }}
{{/* 生成选择器标签 */}}
{{- define "ai-inference.selectorLabels" -}}
app.kubernetes.io/name: {{ include "ai-inference.name" . }}
app.kubernetes.io/instance: {{ .Release.Name }}
{{- end }}
{{/* 生成镜像地址 */}}
{{- define "ai-inference.image" -}}
{{- if .Values.image.digest -}}
{{ .Values.image.repository }}@{{ .Values.image.digest }}
{{- else -}}
{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}
{{- end -}}
{{- end }}
{{/* 生成GPU资源配置 */}}
{{- define "ai-inference.gpuResources" -}}
{{- if .Values.resources.gpu.enabled }}
nvidia.com/gpu: {{ .Values.resources.gpu.count | quote }}
{{- end }}
{{- end }}
{{/* 根据模型来源生成环境变量 */}}
{{- define "ai-inference.modelEnvVars" -}}
- name: MODEL_NAME
value: {{ .Values.model.name | quote }}
- name: MODEL_VERSION
value: {{ .Values.model.version | quote }}
- name: MODEL_SOURCE
value: {{ .Values.model.source | quote }}
{{- if eq .Values.model.source "local" }}
- name: MODEL_LOCAL_PATH
value: {{ .Values.model.localPath | quote }}
{{- else if or (eq .Values.model.source "s3") (eq .Values.model.source "oss") }}
- name: MODEL_BUCKET
value: {{ .Values.model.s3Bucket | quote }}
- name: MODEL_KEY
value: {{ .Values.model.s3Key | quote }}
{{- else if eq .Values.model.source "huggingface" }}
- name: MODEL_HF_REPO
value: {{ .Values.model.hfRepo | quote }}
{{- end }}
{{- end }}deployment.yaml:Deployment模板
apiVersion: apps/v1
kind: Deployment
metadata:
name: {{ include "ai-inference.fullname" . }}
namespace: {{ .Release.Namespace }}
labels:
{{- include "ai-inference.labels" . | nindent 4 }}
annotations:
deployment.kubernetes.io/revision: "{{ .Values.image.tag | default .Chart.AppVersion }}"
spec:
{{- if not .Values.autoscaling.enabled }}
replicas: {{ .Values.replicaCount }}
{{- end }}
selector:
matchLabels:
{{- include "ai-inference.selectorLabels" . | nindent 6 }}
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0 # AI服务零停机更新
maxSurge: 1
template:
metadata:
labels:
{{- include "ai-inference.labels" . | nindent 8 }}
annotations:
# 当ConfigMap变化时触发Pod重启
checksum/config: {{ include (print $.Template.BasePath "/configmap.yaml") . | sha256sum }}
spec:
{{- with .Values.global.imagePullSecrets }}
imagePullSecrets:
{{- toYaml . | nindent 8 }}
{{- end }}
serviceAccountName: {{ include "ai-inference.serviceAccountName" . }}
terminationGracePeriodSeconds: 300
{{- with .Values.nodeSelector }}
nodeSelector:
{{- toYaml . | nindent 8 }}
{{- end }}
{{- with .Values.tolerations }}
tolerations:
{{- toYaml . | nindent 8 }}
{{- end }}
{{- with .Values.affinity }}
affinity:
{{- toYaml . | nindent 8 }}
{{- end }}
{{- if eq .Values.model.source "s3" }}
# 模型初始化容器:从S3下载模型文件
initContainers:
- name: model-downloader
image: amazon/aws-cli:latest
command:
- /bin/sh
- -c
- |
echo "开始下载模型..."
aws s3 cp s3://{{ .Values.model.s3Bucket }}/{{ .Values.model.s3Key }} /models/ --recursive
echo "模型下载完成"
env:
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: {{ include "ai-inference.fullname" . }}-secrets
key: aws-access-key-id
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: {{ include "ai-inference.fullname" . }}-secrets
key: aws-secret-access-key
volumeMounts:
- name: model-storage
mountPath: /models
{{- end }}
containers:
- name: inference-server
image: {{ include "ai-inference.image" . }}
imagePullPolicy: {{ .Values.image.pullPolicy }}
ports:
- name: http
containerPort: {{ .Values.service.port }}
- name: metrics
containerPort: {{ .Values.config.metrics.port }}
env:
{{- include "ai-inference.modelEnvVars" . | nindent 8 }}
- name: SPRING_PROFILES_ACTIVE
value: {{ .Values.global.environment | quote }}
- name: INFERENCE_ENGINE
value: {{ .Values.inference.engine | quote }}
- name: MAX_BATCH_SIZE
value: {{ .Values.inference.maxBatchSize | quote }}
- name: MAX_CONCURRENT_REQUESTS
value: {{ .Values.inference.maxConcurrentRequests | quote }}
- name: VECTOR_DB_HOST
value: {{ .Values.vectorDB.host | quote }}
- name: VECTOR_DB_PORT
value: {{ .Values.vectorDB.port | quote }}
- name: EMBEDDING_SERVICE_URL
value: {{ .Values.embeddingService.url | quote }}
# 敏感配置从Secret读取
- name: OPENAI_API_KEY
valueFrom:
secretKeyRef:
name: {{ include "ai-inference.fullname" . }}-secrets
key: openai-api-key
optional: true # 不是所有环境都需要
resources:
requests:
cpu: {{ .Values.resources.requests.cpu | quote }}
memory: {{ .Values.resources.requests.memory | quote }}
{{- include "ai-inference.gpuResources" . | nindent 12 }}
limits:
cpu: {{ .Values.resources.limits.cpu | quote }}
memory: {{ .Values.resources.limits.memory | quote }}
{{- include "ai-inference.gpuResources" . | nindent 12 }}
volumeMounts:
- name: app-config
mountPath: /app/config
readOnly: true
{{- if .Values.persistence.enabled }}
- name: model-storage
mountPath: {{ .Values.model.localPath }}
readOnly: true
{{- end }}
# 健康检查
{{- if .Values.healthCheck.startupProbe.enabled }}
startupProbe:
httpGet:
path: /actuator/health
port: http
initialDelaySeconds: {{ .Values.healthCheck.startupProbe.initialDelaySeconds }}
periodSeconds: {{ .Values.healthCheck.startupProbe.periodSeconds }}
failureThreshold: {{ .Values.healthCheck.startupProbe.failureThreshold }}
{{- end }}
livenessProbe:
httpGet:
path: {{ .Values.healthCheck.livenessProbe.path }}
port: http
initialDelaySeconds: {{ .Values.healthCheck.livenessProbe.initialDelaySeconds }}
periodSeconds: {{ .Values.healthCheck.livenessProbe.periodSeconds }}
timeoutSeconds: {{ .Values.healthCheck.livenessProbe.timeoutSeconds }}
failureThreshold: {{ .Values.healthCheck.livenessProbe.failureThreshold }}
readinessProbe:
httpGet:
path: {{ .Values.healthCheck.readinessProbe.path }}
port: http
initialDelaySeconds: {{ .Values.healthCheck.readinessProbe.initialDelaySeconds }}
periodSeconds: {{ .Values.healthCheck.readinessProbe.periodSeconds }}
timeoutSeconds: {{ .Values.healthCheck.readinessProbe.timeoutSeconds }}
failureThreshold: {{ .Values.healthCheck.readinessProbe.failureThreshold }}
volumes:
- name: app-config
configMap:
name: {{ include "ai-inference.fullname" . }}-config
{{- if .Values.persistence.enabled }}
- name: model-storage
persistentVolumeClaim:
claimName: {{ include "ai-inference.fullname" . }}-models
{{- end }}多环境部署的CI/CD流程
有了Helm Chart,CI/CD流程就能标准化。我们用GitLab CI的一个示例:
# .gitlab-ci.yml
stages:
- build
- test
- deploy-test
- deploy-staging
- deploy-prod
variables:
CHART_NAME: ai-inference
HELM_TIMEOUT: "600s"
.helm_deploy: &helm_deploy
image: alpine/helm:3.12.0
before_script:
- helm repo add stable https://charts.helm.sh/stable
- kubectl config use-context $KUBE_CONTEXT
deploy-to-test:
<<: *helm_deploy
stage: deploy-test
environment:
name: test
variables:
KUBE_CONTEXT: $KUBE_CONTEXT_TEST
script:
- |
helm upgrade --install ${CHART_NAME} ./helm/${CHART_NAME} \
--namespace ai-test \
--create-namespace \
-f ./helm/${CHART_NAME}/values.yaml \
-f ./helm/${CHART_NAME}/values-test.yaml \
--set image.tag=${CI_COMMIT_SHORT_SHA} \
--set global.environment=test \
--timeout ${HELM_TIMEOUT} \
--wait \
--atomic # 如果部署失败自动回滚
only:
- develop
deploy-to-prod:
<<: *helm_deploy
stage: deploy-prod
environment:
name: production
variables:
KUBE_CONTEXT: $KUBE_CONTEXT_PROD
script:
- |
# 生产部署前先做dry-run
helm upgrade --install ${CHART_NAME} ./helm/${CHART_NAME} \
--namespace ai-prod \
-f ./helm/${CHART_NAME}/values.yaml \
-f ./helm/${CHART_NAME}/values-prod.yaml \
--set image.tag=${CI_COMMIT_SHORT_SHA} \
--set global.environment=prod \
--dry-run 2>&1 | tee /tmp/helm-dry-run.log
echo "Dry-run结果:"
cat /tmp/helm-dry-run.log
# 正式部署
helm upgrade --install ${CHART_NAME} ./helm/${CHART_NAME} \
--namespace ai-prod \
-f ./helm/${CHART_NAME}/values.yaml \
-f ./helm/${CHART_NAME}/values-prod.yaml \
--set image.tag=${CI_COMMIT_SHORT_SHA} \
--set global.environment=prod \
--timeout ${HELM_TIMEOUT} \
--wait \
--atomic \
--history-max 10 # 保留最近10个版本,方便回滚
when: manual # 生产部署需要手动触发
only:
- main回滚命令
Helm的回滚非常方便:
# 查看Release历史
helm history ai-inference -n ai-prod
# 回滚到上一个版本
helm rollback ai-inference -n ai-prod
# 回滚到指定版本
helm rollback ai-inference 3 -n ai-prod
# 回滚后验证
kubectl rollout status deployment/ai-inference -n ai-prod一个经常被忽视的最佳实践:Chart版本和App版本分离
很多团队用Helm时把Chart版本和应用版本绑定在一起——代码改了,Chart版本也跟着改。这样做的问题是:如果只是修改了Chart模板(比如加了个标签),应用版本也会变,让版本追踪很混乱。
正确做法是分离管理:
Chart.version:Chart结构变化时递增(模板修改、新增/删除资源等)Chart.appVersion:应用代码变化时递增(跟随服务镜像版本)- 部署时通过
--set image.tag=xxx覆盖具体的镜像Tag
这样,"这次部署用了哪个Chart模板?部署了哪个应用版本?"这两个问题是独立可查的。
总结
Helm Chart封装AI服务,核心价值不是技术炫技,而是把多环境配置管理的复杂性集中到一个地方,让不同环境的部署差异一目了然,同时让CI/CD流程标准化。
设计一个好的AI服务Helm Chart,需要:
- 识别哪些配置在不同环境之间会变化,把它们抽成values变量
- 把通用逻辑写成_helpers.tpl里的模板函数,避免重复
- 生产环境的values文件只写覆盖项,保持简洁
- Secret的实际内容不进Chart,由CI/CD在部署时注入
--atomic参数保证失败自动回滚,生产环境必加
一套设计合理的Helm Chart,能让发版从"修改多份YAML、祈祷没改错"变成"一行命令、有问题自动回滚"。这个改变对团队效率的提升是实实在在的。
