第2337篇：Java AI服务的健康检查设计——如何定义AI系统的就绪状态

老张2026/4/30大约 5 分钟

第2337篇：Java AI服务的健康检查设计——如何定义AI系统的就绪状态

适读人群：在Kubernetes或云环境部署AI服务的Java工程师，关注AI服务高可用性的运维开发 | 阅读时长：约14分钟 | 核心价值：设计符合AI服务特点的健康检查机制，避免"假健康"导致的线上问题

我们有一次很奇怪的故障：监控显示服务是健康的（所有健康检查都是绿色），但用户在投诉AI功能完全不可用。

查下来原因是：Spring Boot Actuator的默认/health端点只检查了数据库连接和磁盘空间，但没有检查LLM API的连通性。LLM提供商那边出了问题，我们的服务"健康地"返回500给所有AI请求。

这次事故让我重新思考：对于AI服务，"健康"意味着什么？

AI服务健康的三个层次

传统服务只需要检查A层，AI服务需要检查所有三层。

Spring Boot Actuator的自定义健康检查

Spring AI提供了一些内置的健康指示器，但覆盖不全，需要补充：

// LLM API健康检查
@Component("llmApi")
@Slf4j
public class LlmApiHealthIndicator implements HealthIndicator {
    
    private final ChatClient chatClient;
    // 健康检查缓存（避免每次健康检查都调用LLM浪费Token）
    private final Cache<String, Health> healthCache;
    
    public LlmApiHealthIndicator(ChatClient.Builder builder) {
        this.chatClient = builder.build();
        this.healthCache = Caffeine.newBuilder()
                .expireAfterWrite(Duration.ofSeconds(30))  // 30秒内复用检查结果
                .build();
    }
    
    @Override
    public Health health() {
        return healthCache.get("llm_health", key -> checkLlmHealth());
    }
    
    private Health checkLlmHealth() {
        long start = System.currentTimeMillis();
        try {
            // 用最便宜的请求验证连通性
            String response = chatClient.prompt()
                    .user("Hi")
                    .call()
                    .content();
            
            long duration = System.currentTimeMillis() - start;
            
            if (response == null || response.isBlank()) {
                return Health.down()
                        .withDetail("reason", "LLM返回空响应")
                        .withDetail("duration_ms", duration)
                        .build();
            }
            
            // 正常，但返回延迟信息
            Health.Builder builder = Health.up()
                    .withDetail("duration_ms", duration);
            
            if (duration > 5000) {
                // 延迟高但仍然可用
                builder.status("SLOW").withDetail("warning", "响应时间较慢：" + duration + "ms");
            }
            
            return builder.build();
            
        } catch (Exception e) {
            long duration = System.currentTimeMillis() - start;
            log.warn("LLM健康检查失败", e);
            
            String errorType = classifyError(e);
            return Health.down()
                    .withDetail("error_type", errorType)
                    .withDetail("error", e.getMessage())
                    .withDetail("duration_ms", duration)
                    .build();
        }
    }
    
    private String classifyError(Exception e) {
        String msg = e.getMessage();
        if (msg == null) return "UNKNOWN";
        if (msg.contains("429")) return "RATE_LIMITED";
        if (msg.contains("401") || msg.contains("403")) return "AUTH_FAILED";
        if (msg.contains("timeout")) return "TIMEOUT";
        if (msg.contains("Connection refused") || msg.contains("connect")) return "CONNECTION_FAILED";
        return "SERVICE_ERROR";
    }
}

// 向量数据库健康检查
@Component("vectorStore")
public class VectorStoreHealthIndicator implements HealthIndicator {
    
    private final VectorStore vectorStore;
    
    @Override
    public Health health() {
        try {
            // 执行一次极小的相似度查询（1个结果，低相似度阈值）
            List<Document> result = vectorStore.similaritySearch(
                    SearchRequest.query("health check")
                            .withTopK(1)
                            .withSimilarityThreshold(0.0)
            );
            
            return Health.up()
                    .withDetail("total_documents", "unknown")  // 某些VectorStore不支持count
                    .withDetail("query_successful", true)
                    .build();
                    
        } catch (Exception e) {
            return Health.down()
                    .withDetail("error", e.getMessage())
                    .build();
        }
    }
}

区分Liveness和Readiness

Kubernetes有两个核心探针，对AI服务的含义不同：

Liveness（存活）：服务进程是否还活着？失败会重启Pod。

AI服务：JVM是否正常、内存是否OOM、关键线程是否死锁
不应该包含LLM API检查（LLM API故障不应该导致Pod重启）

Readiness（就绪）：服务是否能处理请求？失败会把Pod从负载均衡摘除。

AI服务：LLM API是否可达、向量数据库是否响应

@Component("ai-readiness")
public class AiReadinessIndicator implements HealthIndicator {
    
    private final LlmApiHealthIndicator llmHealth;
    private final VectorStoreHealthIndicator vectorStoreHealth;
    
    @Override
    public Health health() {
        Health llm = llmHealth.health();
        Health vectorStore = vectorStoreHealth.health();
        
        boolean llmOk = llm.getStatus().equals(Status.UP) || 
                        llm.getStatus().getCode().equals("SLOW");
        boolean vsOk = vectorStore.getStatus().equals(Status.UP);
        
        if (llmOk && vsOk) {
            return Health.up()
                    .withDetail("llm", llm.getStatus())
                    .withDetail("vectorStore", vectorStore.getStatus())
                    .build();
        }
        
        // 任一核心依赖不可用，标记为不就绪
        return Health.down()
                .withDetail("llm", llm.getStatus())
                .withDetail("llm_detail", llm.getDetails())
                .withDetail("vectorStore", vectorStore.getStatus())
                .build();
    }
}

# application.yml：配置健康检查分组
management:
  health:
    # 就绪探针：包含AI依赖检查
    readinessState:
      enabled: true
    livenessState:
      enabled: true
  endpoint:
    health:
      group:
        readiness:
          include:
            - readinessState
            - ai-readiness      # 自定义的AI就绪检查
            - db                # 数据库
        liveness:
          include:
            - livenessState     # 只检查JVM状态

# Kubernetes deployment配置
livenessProbe:
  httpGet:
    path: /actuator/health/liveness
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 30
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /actuator/health/readiness
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 15
  failureThreshold: 2   # 2次失败就摘除（比liveness更敏感）

优雅降级：部分健康的处理

AI服务的各个功能对依赖的要求不同：

查询订单（工具调用）：不需要向量数据库
RAG问答：需要向量数据库
基础对话：只需要LLM API

可以实现细粒度的功能降级，而不是全服务不可用：

@Service
@RequiredArgsConstructor
public class AiFeatureFlags {
    
    private final LlmApiHealthIndicator llmHealth;
    private final VectorStoreHealthIndicator vectorStoreHealth;
    
    public boolean isBasicChatAvailable() {
        Status status = llmHealth.health().getStatus();
        return status.equals(Status.UP) || status.getCode().equals("SLOW");
    }
    
    public boolean isRagAvailable() {
        return isBasicChatAvailable() && vectorStoreHealth.health().getStatus().equals(Status.UP);
    }
    
    public AvailableFeatures getAvailableFeatures() {
        boolean chat = isBasicChatAvailable();
        boolean rag = isRagAvailable();
        return new AvailableFeatures(chat, rag);
    }
    
    public record AvailableFeatures(boolean basicChat, boolean rag) {}
}

// Controller中使用
@PostMapping("/ask")
public ResponseEntity<AiResponse> ask(@RequestBody AskRequest request) {
    AvailableFeatures features = featureFlags.getAvailableFeatures();
    
    if (!features.basicChat()) {
        return ResponseEntity.serviceUnavailable()
                .body(AiResponse.degraded("AI服务暂时不可用，请稍后重试"));
    }
    
    if (request.needsKnowledgeBase() && !features.rag()) {
        // 知识库不可用时，降级为普通对话（告知用户）
        return ResponseEntity.ok(
                chatService.chatWithoutKnowledge(request.message(), "（注：知识库暂时不可用，本回答基于模型内置知识）"));
    }
    
    return ResponseEntity.ok(ragService.ask(request.message()));
}

健康检查的监控告警

健康状态变化要触发告警：

@Component
@Slf4j
public class HealthChangeAlertListener implements ApplicationListener<HealthChangedEvent> {
    
    private final AlertService alertService;
    private volatile Status lastStatus = Status.UP;
    
    @Scheduled(fixedDelay = 30000)  // 每30秒检查一次
    public void checkHealthChange() {
        // 通过Actuator获取当前健康状态
        // 简化示意：实际通过注入各个HealthIndicator
    }
    
    @EventListener
    public void onHealthChanged(HealthChangedEvent event) {
        Status newStatus = event.getHealth().getStatus();
        
        if (!newStatus.equals(lastStatus)) {
            if (Status.UP.equals(lastStatus) && !Status.UP.equals(newStatus)) {
                // 从健康变为不健康
                alertService.sendAlert(String.format(
                        "[告警] AI服务健康状态变化：%s -> %s\n详情：%s",
                        lastStatus, newStatus, event.getHealth().getDetails()));
            } else if (Status.UP.equals(newStatus)) {
                // 恢复健康
                alertService.sendAlert("[恢复] AI服务已恢复正常");
            }
            lastStatus = newStatus;
        }
    }
}

AI服务的健康检查设计，核心原则是：不要让基础设施的健康掩盖AI功能的不可用。用户感知的健康才是真正的健康。