Kubernetes探针设计：readinessProbe让滚动更新真正零停机

老张2026/4/30大约 6 分钟

Kubernetes探针设计：readinessProbe让滚动更新真正零停机

适读人群：在K8s上部署Java微服务、关注服务可用性的开发者 | 阅读时长：约18分钟

开篇故事

我们有段时间每次发布都会有 1-2 分钟的 502 错误，虽然很短，但用户能感受到。

排查原因：新 Pod 启动后，K8s 认为它 Ready 了（因为进程启动了），开始往它转发流量。但这时候 Spring Boot 还没完成初始化（连接池还没建好、缓存还没预热），请求进来直接报错。

解决方案非常简单：配一个正确的 readinessProbe，指向 Spring Boot Actuator 的就绪端点。等应用真正准备好才开始接流量。

加了这个配置之后，再也没有发布时的 502 了。今天把三种探针的完整设计方案整理出来。

一、三种探针的职责

关键区分：

livenessProbe 失败 → Pod 重启（比较激进）
readinessProbe 失败 → 从负载均衡摘除，但 Pod 继续运行

二、Spring Boot Actuator 健康端点

2.1 添加依赖

<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>

2.2 配置

management:
  endpoint:
    health:
      show-details: always    # 显示详细信息（可按环境配置）
      probes:
        enabled: true         # 启用 liveness 和 readiness 端点
  endpoints:
    web:
      exposure:
        include: health, info, metrics
  health:
    livenessstate:
      enabled: true
    readinessstate:
      enabled: true

开启后会暴露两个专用端点：

GET /actuator/health/liveness → {"status": "UP"}
GET /actuator/health/readiness → {"status": "UP"} 或 {"status": "OUT_OF_SERVICE"}

三、完整 K8s 配置

3.1 Deployment 完整配置

apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-service
  namespace: production
spec:
  replicas: 3
  # 滚动更新策略
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1         # 最多额外启动1个新Pod
      maxUnavailable: 0   # 更新过程中不允许有不可用Pod（零停机）
  selector:
    matchLabels:
      app: order-service
  template:
    metadata:
      labels:
        app: order-service
    spec:
      # 优雅停机等待时间（比preStop + 业务处理时间大）
      terminationGracePeriodSeconds: 90

      containers:
        - name: order-service
          image: your-registry/order-service:1.2.0
          ports:
            - containerPort: 8080

          # 资源限制（必须配置！）
          resources:
            requests:
              memory: "512Mi"
              cpu: "250m"
            limits:
              memory: "1Gi"
              cpu: "1000m"

          # 环境变量
          env:
            - name: JAVA_OPTS
              value: "-XX:+UseContainerSupport -XX:MaxRAMPercentage=75.0"
            - name: SPRING_PROFILES_ACTIVE
              value: "production"

          # ===== 启动探针 =====
          # Spring Boot 启动可能需要 30-60 秒
          # startupProbe 期间不检查 liveness，避免 Spring 初始化期间被误杀
          startupProbe:
            httpGet:
              path: /actuator/health/liveness
              port: 8080
            initialDelaySeconds: 10   # 启动后10秒开始检测
            periodSeconds: 10         # 每10秒检测一次
            failureThreshold: 12      # 最多失败12次 = 等待120秒（足够Spring Boot启动）
            successThreshold: 1
            timeoutSeconds: 5

          # ===== 存活探针 =====
          # 检查应用是否存活（JVM还在，主要业务逻辑没有死锁）
          livenessProbe:
            httpGet:
              path: /actuator/health/liveness
              port: 8080
            initialDelaySeconds: 0    # startupProbe通过后就开始
            periodSeconds: 15
            failureThreshold: 3       # 连续失败3次（45秒）才重启，避免抖动
            successThreshold: 1
            timeoutSeconds: 5

          # ===== 就绪探针 =====
          # 检查应用是否可以接收流量
          # 失败时 K8s 把 Pod 从 Service 的 Endpoints 里移除
          readinessProbe:
            httpGet:
              path: /actuator/health/readiness
              port: 8080
            initialDelaySeconds: 0
            periodSeconds: 5          # 每5秒检测一次（比liveness更频繁）
            failureThreshold: 3       # 失败3次（15秒）才摘除
            successThreshold: 1
            timeoutSeconds: 3

          # 优雅停机配置
          lifecycle:
            preStop:
              exec:
                # 收到SIGTERM后先等30秒，让进行中的请求处理完
                # 注意：preStop 执行时间算在 terminationGracePeriodSeconds 里
                command: ["/bin/sh", "-c", "sleep 15"]

3.2 Spring Boot 优雅停机配置

# application.yml
server:
  shutdown: graceful  # 优雅停机（等待当前请求处理完）

spring:
  lifecycle:
    timeout-per-shutdown-phase: 30s  # 等待时间，和 K8s 的 preStop 配合

3.3 自定义健康指标

package com.laozhang.health;

import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j;
import org.springframework.boot.actuate.health.Health;
import org.springframework.boot.actuate.health.HealthIndicator;
import org.springframework.jdbc.core.JdbcTemplate;
import org.springframework.stereotype.Component;

/**
 * 自定义数据库健康检查
 * 会被 /actuator/health/readiness 聚合
 */
@Slf4j
@Component("database")
@RequiredArgsConstructor
public class DatabaseHealthIndicator implements HealthIndicator {

    private final JdbcTemplate jdbcTemplate;

    @Override
    public Health health() {
        try {
            // 执行简单查询验证数据库可用
            Integer result = jdbcTemplate.queryForObject("SELECT 1", Integer.class);
            if (Integer.valueOf(1).equals(result)) {
                return Health.up()
                    .withDetail("database", "MySQL")
                    .withDetail("status", "reachable")
                    .build();
            }
        } catch (Exception e) {
            log.error("[Health] 数据库检查失败", e);
            return Health.down()
                .withDetail("database", "MySQL")
                .withDetail("error", e.getMessage())
                .build();
        }
        return Health.unknown().build();
    }
}

package com.laozhang.health;

import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j;
import org.springframework.boot.actuate.health.Health;
import org.springframework.boot.actuate.health.HealthIndicator;
import org.springframework.data.redis.connection.RedisConnectionFactory;
import org.springframework.stereotype.Component;

/**
 * Redis 健康检查
 */
@Slf4j
@Component("redis")
@RequiredArgsConstructor
public class RedisHealthIndicator implements HealthIndicator {

    private final RedisConnectionFactory connectionFactory;

    @Override
    public Health health() {
        try {
            connectionFactory.getConnection().ping();
            return Health.up()
                .withDetail("redis", "connected")
                .build();
        } catch (Exception e) {
            log.error("[Health] Redis检查失败", e);
            return Health.down()
                .withDetail("redis", "unreachable")
                .withDetail("error", e.getMessage())
                .build();
        }
    }
}

四、滚动更新零停机的完整流程

五、踩坑实录

坑1：startupProbe 超时时间不够，健康的 Pod 被 K8s 重启

症状：Java 应用需要 60 秒启动，但 startupProbe 配置了 failureThreshold: 3 和 periodSeconds: 10，30秒超时，Pod 不断被重启。

解决：根据实际启动时间设置，留 50% 余量：

startupProbe:
  periodSeconds: 10
  failureThreshold: 18   # 18 * 10 = 180秒，远大于应用启动时间

坑2：readinessProbe 检测路径返回 503

症状：/actuator/health/readiness 返回 503，Pod 一直不就绪。

根因：某个自定义 HealthIndicator 返回了 DOWN，被聚合到 readiness 端点里。

排查：访问 /actuator/health 查看详细信息，找到哪个 indicator 是 DOWN。

配置排除非关键依赖：

management:
  health:
    # 如果消息队列不影响核心功能，可以排除
    rabbit:
      enabled: false
    # 或者设置不参与readiness聚合

坑3：preStop 时间和 terminationGracePeriodSeconds 配置不匹配

正确关系：

terminationGracePeriodSeconds >= preStop执行时间 + 应用优雅停机时间

示例：
preStop: sleep 15         # 15秒
应用优雅停机: 最长30秒   # spring.lifecycle.timeout-per-shutdown-phase
terminationGracePeriodSeconds: 90  # 15 + 30 + 45秒余量

如果 terminationGracePeriodSeconds 太小，K8s 会强制发 SIGKILL，请求直接中断。

六、总结

三种探针的正确配置原则：

startupProbe：只设一次，failureThreshold 要大（留足启动时间），通过后才开始 liveness 检测
livenessProbe：检测存活，失败阈值不要太低（避免网络抖动引起误杀），失败才重启
readinessProbe：检测就绪，周期要短（快速摘除不健康实例），失败不重启只摘流量

零停机滚动更新的充分条件：maxUnavailable=0 + 正确的 readinessProbe + 配合 preStop 和优雅停机。