第1651篇：Kubernetes部署AI应用的完整实践——ConfigMap、Secret与HPA的AI特化配置

老张2026/4/30大约 11 分钟

第1651篇：Kubernetes部署AI应用的完整实践——ConfigMap、Secret与HPA的AI特化配置

去年我们团队把一套推荐系统从虚机迁到K8s，踩了不少坑。最让我印象深刻的一次：一个周五下午，模型服务突然OOM崩了，一查发现HPA配置完全是照搬Web服务的模板，压根没考虑GPU内存的特殊性。那天下班前把问题修完，总结下来有一堆东西是AI应用特有的，普通的K8s教程基本不会提。

这篇文章就来系统聊聊，AI服务在K8s里部署时，ConfigMap、Secret、HPA这三块有哪些和普通服务不一样的地方，以及我们团队的实际做法。

先说说AI应用为什么和普通Web服务不一样

普通Spring Boot服务部署到K8s，基本就是副本数、CPU/内存限制、健康检查，套个模板就行。但AI应用有几个特殊性：

模型文件体积大。一个中等规模的LLM，模型权重文件动辄几GB到几十GB，不可能放ConfigMap里，也不适合打镜像，得单独处理。

推理是有状态的。这里说的"有状态"不是指数据持久化，而是指GPU显存里加载的模型状态。如果你的HPA把某个Pod缩掉了，而那个Pod正在处理一个长推理请求，直接杀掉就是灾难。

资源消耗模式不同。Web服务的CPU使用率跟QPS基本线性相关，AI推理服务则不一定——batch推理时CPU/GPU都高，空闲时几乎为零，但两者之间还有一段"预热期"，这段时间资源消耗特殊。

配置项敏感。模型路径、API Key、向量数据库连接串，这些东西如果配置错误，轻则服务异常，重则数据泄露。

理解了这些差异，我们来逐个看怎么配置。

ConfigMap：不只是存配置

基础用法大家都会，但AI场景有特殊需求

最简单的ConfigMap是把配置文件挂进去：

apiVersion: v1
kind: ConfigMap
metadata:
  name: ai-service-config
  namespace: ai-prod
data:
  application.yml: |
    server:
      port: 8080
    model:
      path: /models/llm-7b
      max-tokens: 4096
      batch-size: 8
      timeout-seconds: 120
    inference:
      thread-pool-size: 4
      queue-max-size: 100
      warm-up-on-start: true

这没什么问题，但AI服务有一个配置是普通服务没有的：推理参数的动态调整。

生产环境里，模型推理参数经常需要根据负载情况动态调整，比如高峰期把batch-size调小来降低单请求延迟，低峰期调大来提高吞吐。如果每次调参都要重新部署，运维会崩溃。

我们的做法是把推理参数单独抽成一个ConfigMap，然后在Java代码里监听ConfigMap变化：

apiVersion: v1
kind: ConfigMap
metadata:
  name: inference-params
  namespace: ai-prod
  labels:
    config-type: dynamic
    reload: "true"
data:
  params.json: |
    {
      "max_batch_size": 8,
      "max_sequence_length": 2048,
      "temperature": 0.7,
      "top_p": 0.9,
      "num_beams": 1,
      "do_sample": true
    }

Java端用Spring Boot的@ConfigurationProperties加上Kubernetes Java客户端来监听变化：

@Component
@Slf4j
public class InferenceParamWatcher {

    private final KubernetesClient k8sClient;
    private volatile InferenceParams currentParams;
    private Watch watch;

    public InferenceParamWatcher(KubernetesClient k8sClient) {
        this.k8sClient = k8sClient;
        this.currentParams = loadDefaultParams();
    }

    @PostConstruct
    public void startWatching() {
        watch = k8sClient.configMaps()
            .inNamespace("ai-prod")
            .withName("inference-params")
            .watch(new Watcher<ConfigMap>() {
                @Override
                public void eventReceived(Action action, ConfigMap resource) {
                    if (action == Action.MODIFIED || action == Action.ADDED) {
                        try {
                            String paramsJson = resource.getData().get("params.json");
                            InferenceParams newParams = parseParams(paramsJson);
                            currentParams = newParams;
                            log.info("推理参数已更新: maxBatchSize={}, temperature={}",
                                newParams.getMaxBatchSize(), newParams.getTemperature());
                        } catch (Exception e) {
                            log.error("推理参数解析失败，继续使用旧参数", e);
                        }
                    }
                }

                @Override
                public void onClose(WatcherException cause) {
                    log.warn("ConfigMap Watch连接断开，尝试重连", cause);
                    // 延迟重连，避免快速重试
                    scheduleReconnect();
                }
            });
    }

    public InferenceParams getCurrentParams() {
        return currentParams;
    }

    private void scheduleReconnect() {
        // 使用指数退避策略重连
        CompletableFuture.delayedExecutor(5, TimeUnit.SECONDS)
            .execute(this::startWatching);
    }

    @PreDestroy
    public void stopWatching() {
        if (watch != null) {
            watch.close();
        }
    }
}

这样运维人员用kubectl edit configmap inference-params修改参数后，Java服务无需重启就能生效。这个模式在我们团队用了快两年了，省了不少事。

ConfigMap的大小限制问题

有一个坑要特别说：ConfigMap单个key最大1MB，整个ConfigMap最大也就1MB（etcd限制）。提示词模板如果写得很长，很容易踩到这个限制。

我们踩过一次，一个RAG系统的system prompt模板加上few-shot示例，加起来超过了1MB限制，kubectl apply直接报错。

解法有两个：

把大提示词模板存到对象存储（比如OSS/S3），启动时拉取，ConfigMap只存路径
把提示词拆分成多个小ConfigMap，按功能模块分

我们最后选了方案1，更灵活，而且提示词可以按版本管理：

apiVersion: v1
kind: ConfigMap
metadata:
  name: prompt-template-refs
  namespace: ai-prod
data:
  system-prompt-url: "oss://ai-prompts/prod/system-prompt-v2.3.txt"
  few-shot-examples-url: "oss://ai-prompts/prod/few-shot-v1.1.json"
  prompt-version: "2.3"

Secret：AI服务的密钥管理比你想的复杂

不只是API Key

AI服务涉及的敏感信息比Web服务多得多：

LLM API Key（OpenAI、Claude等）
向量数据库连接串（Milvus、Weaviate）
模型仓库认证（HuggingFace token、私有模型仓库）
GPU集群访问凭证
数据标注平台API Key

最基础的做法是用K8s Secret：

apiVersion: v1
kind: Secret
metadata:
  name: ai-service-secrets
  namespace: ai-prod
type: Opaque
stringData:
  openai-api-key: "sk-xxxxxxxxxxxxxxxx"
  milvus-password: "your-milvus-password"
  huggingface-token: "hf_xxxxxxxxxxxxxxxx"
  model-registry-url: "https://your-model-registry.internal"
  model-registry-token: "token-xxxxxxxx"

然后在Deployment里引用：

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ai-inference-service
  namespace: ai-prod
spec:
  template:
    spec:
      containers:
      - name: inference-server
        image: your-registry/ai-inference:v2.1.0
        env:
        - name: OPENAI_API_KEY
          valueFrom:
            secretKeyRef:
              name: ai-service-secrets
              key: openai-api-key
        - name: MILVUS_PASSWORD
          valueFrom:
            secretKeyRef:
              name: ai-service-secrets
              key: milvus-password
        - name: HF_TOKEN
          valueFrom:
            secretKeyRef:
              name: ai-service-secrets
              key: huggingface-token

但原生Secret有个大问题

K8s原生Secret在etcd里是Base64编码存储，不是加密存储。任何有kubectl get secret权限的人都能直接读到明文。在有严格安全要求的公司，这是过不了安全审计的。

我们现在的做法是用External Secrets Operator对接Vault或者阿里云KMS：

# 先安装External Secrets Operator
# helm install external-secrets external-secrets/external-secrets

apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
  name: vault-backend
  namespace: ai-prod
spec:
  provider:
    vault:
      server: "https://vault.internal:8200"
      path: "secret"
      version: "v2"
      auth:
        kubernetes:
          mountPath: "kubernetes"
          role: "ai-service-role"
          serviceAccountRef:
            name: ai-service-sa
---
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: ai-service-external-secret
  namespace: ai-prod
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: vault-backend
    kind: SecretStore
  target:
    name: ai-service-secrets
    creationPolicy: Owner
  data:
  - secretKey: openai-api-key
    remoteRef:
      key: ai-prod/openai
      property: api_key
  - secretKey: milvus-password
    remoteRef:
      key: ai-prod/milvus
      property: password

这样Secret的实际内容存在Vault里，K8s里只有一个自动同步过来的"影子Secret"，定期轮转也方便。

API Key轮转的Java处理

AI服务有个特殊场景：LLM API Key过期或者达到限额需要切换。如果Key变了要重启Pod，那对线上服务影响太大。

我写了一个带重试和Key轮转的OpenAI客户端：

@Component
@Slf4j
public class ResilientOpenAIClient {

    private final Queue<String> apiKeyPool;
    private final AtomicInteger requestCount = new AtomicInteger(0);
    private final Map<String, Integer> keyFailureCount = new ConcurrentHashMap<>();

    public ResilientOpenAIClient(SecretsManager secretsManager) {
        // 从Secret Manager加载多个API Key（主Key + 备用Key）
        List<String> keys = secretsManager.getApiKeys("openai");
        this.apiKeyPool = new ConcurrentLinkedQueue<>(keys);
        log.info("初始化OpenAI客户端，可用Key数量: {}", keys.size());
    }

    public ChatCompletionResult chat(ChatCompletionRequest request) {
        int maxAttempts = apiKeyPool.size();
        Exception lastException = null;

        for (int attempt = 0; attempt < maxAttempts; attempt++) {
            String currentKey = apiKeyPool.peek();
            if (currentKey == null) {
                throw new RuntimeException("没有可用的API Key");
            }

            try {
                OpenAiService service = createService(currentKey);
                ChatCompletionResult result = service.createChatCompletion(request);
                // 成功后重置失败计数
                keyFailureCount.put(currentKey, 0);
                return result;
            } catch (OpenAiHttpException e) {
                if (e.statusCode == 401 || e.statusCode == 429) {
                    // Key无效或超限，轮转到下一个
                    log.warn("API Key触发限制 (status={}), 切换到下一个Key", e.statusCode);
                    rotateKey(currentKey);
                    lastException = e;
                } else {
                    throw e;
                }
            }
        }

        throw new RuntimeException("所有API Key均不可用", lastException);
    }

    private void rotateKey(String failedKey) {
        // 把失败的Key移到队尾，而不是直接丢弃
        // 可能只是临时限流，下次还能用
        apiKeyPool.poll();
        int failCount = keyFailureCount.merge(failedKey, 1, Integer::sum);
        if (failCount < 3) {
            apiKeyPool.offer(failedKey);
            log.info("Key移至队尾，失败次数: {}", failCount);
        } else {
            log.error("Key连续失败{}次，暂时从池中移除", failCount);
            // 这里可以触发告警通知
        }
    }
}

HPA：AI服务的弹性伸缩是个技术活

这是最复杂的一块，也是踩坑最多的地方。

为什么不能直接用CPU指标

普通Web服务用CPU利用率做HPA指标，非常合理。AI推理服务就不行了，原因有三个：

GPU推理时CPU利用率可能很低，但GPU打满了，用CPU判断根本不准
模型加载期间CPU会短暂飙高，触发错误的扩容
推理请求大小差异极大（短文本 vs 长文本），同样的QPS但CPU消耗天差地别

我们测过，一个7B的模型，处理1个长文本推理请求（2000 tokens）消耗的GPU资源，大概等于处理20个短文本请求（50 tokens）。如果用CPU指标，这两种情况完全无法区分。

基于请求队列深度的HPA

最实用的方案是自定义指标：监控推理请求队列的深度，当积压超过阈值时扩容。

Java服务暴露自定义指标：

@Component
public class InferenceMetricsExporter {

    private final InferenceQueue inferenceQueue;
    private final MeterRegistry meterRegistry;

    public InferenceMetricsExporter(InferenceQueue inferenceQueue,
                                     MeterRegistry meterRegistry) {
        this.inferenceQueue = inferenceQueue;
        this.meterRegistry = meterRegistry;
        registerMetrics();
    }

    private void registerMetrics() {
        // 队列深度
        Gauge.builder("ai.inference.queue.depth")
            .description("当前推理队列中等待处理的请求数")
            .register(meterRegistry, inferenceQueue, InferenceQueue::getPendingCount);

        // 平均推理延迟
        Gauge.builder("ai.inference.latency.p99")
            .description("P99推理延迟（毫秒）")
            .register(meterRegistry, this, self -> self.getP99Latency());

        // GPU内存使用率（通过nvidia-smi获取）
        Gauge.builder("ai.gpu.memory.usage.percent")
            .description("GPU显存使用率")
            .register(meterRegistry, this, self -> self.getGpuMemoryUsage());
    }

    private double getP99Latency() {
        // 从本地统计数据获取P99延迟
        return inferenceQueue.getLatencyPercentile(0.99);
    }

    private double getGpuMemoryUsage() {
        try {
            Process process = Runtime.getRuntime().exec(
                new String[]{"nvidia-smi", "--query-gpu=memory.used,memory.total",
                             "--format=csv,noheader,nounits"});
            BufferedReader reader = new BufferedReader(
                new InputStreamReader(process.getInputStream()));
            String line = reader.readLine();
            if (line != null) {
                String[] parts = line.trim().split(",\\s*");
                double used = Double.parseDouble(parts[0]);
                double total = Double.parseDouble(parts[1]);
                return (used / total) * 100;
            }
        } catch (Exception e) {
            log.warn("获取GPU内存信息失败", e);
        }
        return 0.0;
    }
}

然后配置Prometheus Adapter把这个指标暴露给HPA：

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-adapter-config
  namespace: monitoring
data:
  config.yaml: |
    rules:
    - seriesQuery: 'ai_inference_queue_depth{namespace!="",pod!=""}'
      resources:
        overrides:
          namespace:
            resource: namespace
          pod:
            resource: pod
      name:
        matches: "ai_inference_queue_depth"
        as: "inference_queue_depth"
      metricsQuery: 'avg(ai_inference_queue_depth{<<.LabelMatchers>>}) by (<<.GroupBy>>)'

HPA配置：

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ai-inference-hpa
  namespace: ai-prod
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ai-inference-service
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Pods
    pods:
      metric:
        name: inference_queue_depth
      target:
        type: AverageValue
        averageValue: "5"   # 每个Pod平均积压不超过5个请求就不扩容
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60    # 扩容前稳定60秒，避免抖动
      policies:
      - type: Pods
        value: 2
        periodSeconds: 60    # 每60秒最多扩容2个Pod
    scaleDown:
      stabilizationWindowSeconds: 300   # 缩容前稳定300秒，AI服务缩容要谨慎
      policies:
      - type: Pods
        value: 1
        periodSeconds: 120   # 每120秒最多缩容1个Pod

优雅终止：AI服务Pod退出的正确姿势

这是一个经常被忽视的细节。普通Web服务Pod收到SIGTERM信号，把飞行中的请求处理完就行，时间很短。AI推理服务不一样，一个推理请求可能要跑30秒甚至几分钟。

如果terminationGracePeriodSeconds设置得不够长，Pod会被强制杀死，导致推理结果丢失。

Java服务的优雅退出处理：

@Component
@Slf4j
public class GracefulShutdownHandler {

    private final InferenceQueue inferenceQueue;
    private final AtomicBoolean shuttingDown = new AtomicBoolean(false);

    @EventListener(ContextClosedEvent.class)
    public void onShutdown(ContextClosedEvent event) {
        log.info("收到关闭信号，开始优雅退出...");
        shuttingDown.set(true);

        // 1. 停止接受新请求
        inferenceQueue.stopAcceptingNewRequests();
        log.info("已停止接受新推理请求");

        // 2. 等待当前正在处理的请求完成
        long waitStart = System.currentTimeMillis();
        long maxWaitMs = 180_000; // 最多等待3分钟

        while (inferenceQueue.getActiveCount() > 0) {
            long elapsed = System.currentTimeMillis() - waitStart;
            if (elapsed > maxWaitMs) {
                log.warn("等待超时，剩余{}个请求未完成，强制退出",
                    inferenceQueue.getActiveCount());
                break;
            }
            log.info("等待{}个推理请求完成，已等待{}秒",
                inferenceQueue.getActiveCount(), elapsed / 1000);
            try {
                Thread.sleep(2000);
            } catch (InterruptedException e) {
                Thread.currentThread().interrupt();
                break;
            }
        }

        log.info("优雅退出完成");
    }

    public boolean isShuttingDown() {
        return shuttingDown.get();
    }
}

对应的Deployment配置：

spec:
  template:
    spec:
      terminationGracePeriodSeconds: 300  # 给足5分钟优雅退出时间
      containers:
      - name: inference-server
        lifecycle:
          preStop:
            exec:
              # preStop会在SIGTERM之前执行，再加一道保险
              command: ["/bin/sh", "-c", "sleep 5"]

把这些整合成一个完整的部署架构

我们现在的AI服务K8s部署架构大致是这样的：

整套配置的核心思路：

ConfigMap管理推理参数，支持热更新
External Secrets对接Vault，密钥不落K8s etcd明文存储
自定义指标驱动HPA，用队列深度而不是CPU
扩容快、缩容慢，给AI服务足够的稳定时间
优雅退出时间设置充裕，保证推理请求不丢失

几个实际经验

关于ConfigMap热更新的延迟：K8s的ConfigMap挂载卷更新有一个延迟，默认是kubelet的syncPeriod（通常1分钟）加上configmap缓存的TTL（默认1分钟），所以热更新可能有2分钟延迟。如果对实时性要求高，用Watch API（前面代码那种方式）比挂载文件要快。

关于GPU节点的节点选择器：AI服务的Pod一定要配nodeSelector或者nodeAffinity，指定只调度到GPU节点，别让它跑到CPU节点上然后模型加载失败。

spec:
  template:
    spec:
      nodeSelector:
        accelerator: nvidia-tesla-t4
      tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"

关于资源限制：GPU资源一定要设limits，否则一个服务可能把整块GPU占满，影响同节点其他服务。

resources:
  requests:
    cpu: "2"
    memory: "8Gi"
    nvidia.com/gpu: "1"
  limits:
    cpu: "4"
    memory: "16Gi"
    nvidia.com/gpu: "1"  # GPU resources requests必须等于limits

最后说一句：K8s部署AI服务的最大难点不是技术本身，而是理解AI服务的运行特性，然后把这些特性翻译成合适的K8s配置。普通运维同学可能对AI推理不熟悉，AI工程师可能对K8s不熟悉，这个中间地带就是AI工程师需要补上的能力。