第2337篇:Java AI服务的健康检查设计——如何定义AI系统的就绪状态
2026/4/30大约 5 分钟
第2337篇:Java AI服务的健康检查设计——如何定义AI系统的就绪状态
适读人群:在Kubernetes或云环境部署AI服务的Java工程师,关注AI服务高可用性的运维开发 | 阅读时长:约14分钟 | 核心价值:设计符合AI服务特点的健康检查机制,避免"假健康"导致的线上问题
我们有一次很奇怪的故障:监控显示服务是健康的(所有健康检查都是绿色),但用户在投诉AI功能完全不可用。
查下来原因是:Spring Boot Actuator的默认/health端点只检查了数据库连接和磁盘空间,但没有检查LLM API的连通性。LLM提供商那边出了问题,我们的服务"健康地"返回500给所有AI请求。
这次事故让我重新思考:对于AI服务,"健康"意味着什么?
AI服务健康的三个层次
传统服务只需要检查A层,AI服务需要检查所有三层。
Spring Boot Actuator的自定义健康检查
Spring AI提供了一些内置的健康指示器,但覆盖不全,需要补充:
// LLM API健康检查
@Component("llmApi")
@Slf4j
public class LlmApiHealthIndicator implements HealthIndicator {
private final ChatClient chatClient;
// 健康检查缓存(避免每次健康检查都调用LLM浪费Token)
private final Cache<String, Health> healthCache;
public LlmApiHealthIndicator(ChatClient.Builder builder) {
this.chatClient = builder.build();
this.healthCache = Caffeine.newBuilder()
.expireAfterWrite(Duration.ofSeconds(30)) // 30秒内复用检查结果
.build();
}
@Override
public Health health() {
return healthCache.get("llm_health", key -> checkLlmHealth());
}
private Health checkLlmHealth() {
long start = System.currentTimeMillis();
try {
// 用最便宜的请求验证连通性
String response = chatClient.prompt()
.user("Hi")
.call()
.content();
long duration = System.currentTimeMillis() - start;
if (response == null || response.isBlank()) {
return Health.down()
.withDetail("reason", "LLM返回空响应")
.withDetail("duration_ms", duration)
.build();
}
// 正常,但返回延迟信息
Health.Builder builder = Health.up()
.withDetail("duration_ms", duration);
if (duration > 5000) {
// 延迟高但仍然可用
builder.status("SLOW").withDetail("warning", "响应时间较慢:" + duration + "ms");
}
return builder.build();
} catch (Exception e) {
long duration = System.currentTimeMillis() - start;
log.warn("LLM健康检查失败", e);
String errorType = classifyError(e);
return Health.down()
.withDetail("error_type", errorType)
.withDetail("error", e.getMessage())
.withDetail("duration_ms", duration)
.build();
}
}
private String classifyError(Exception e) {
String msg = e.getMessage();
if (msg == null) return "UNKNOWN";
if (msg.contains("429")) return "RATE_LIMITED";
if (msg.contains("401") || msg.contains("403")) return "AUTH_FAILED";
if (msg.contains("timeout")) return "TIMEOUT";
if (msg.contains("Connection refused") || msg.contains("connect")) return "CONNECTION_FAILED";
return "SERVICE_ERROR";
}
}// 向量数据库健康检查
@Component("vectorStore")
public class VectorStoreHealthIndicator implements HealthIndicator {
private final VectorStore vectorStore;
@Override
public Health health() {
try {
// 执行一次极小的相似度查询(1个结果,低相似度阈值)
List<Document> result = vectorStore.similaritySearch(
SearchRequest.query("health check")
.withTopK(1)
.withSimilarityThreshold(0.0)
);
return Health.up()
.withDetail("total_documents", "unknown") // 某些VectorStore不支持count
.withDetail("query_successful", true)
.build();
} catch (Exception e) {
return Health.down()
.withDetail("error", e.getMessage())
.build();
}
}
}区分Liveness和Readiness
Kubernetes有两个核心探针,对AI服务的含义不同:
Liveness(存活):服务进程是否还活着?失败会重启Pod。
- AI服务:JVM是否正常、内存是否OOM、关键线程是否死锁
- 不应该包含LLM API检查(LLM API故障不应该导致Pod重启)
Readiness(就绪):服务是否能处理请求?失败会把Pod从负载均衡摘除。
- AI服务:LLM API是否可达、向量数据库是否响应
@Component("ai-readiness")
public class AiReadinessIndicator implements HealthIndicator {
private final LlmApiHealthIndicator llmHealth;
private final VectorStoreHealthIndicator vectorStoreHealth;
@Override
public Health health() {
Health llm = llmHealth.health();
Health vectorStore = vectorStoreHealth.health();
boolean llmOk = llm.getStatus().equals(Status.UP) ||
llm.getStatus().getCode().equals("SLOW");
boolean vsOk = vectorStore.getStatus().equals(Status.UP);
if (llmOk && vsOk) {
return Health.up()
.withDetail("llm", llm.getStatus())
.withDetail("vectorStore", vectorStore.getStatus())
.build();
}
// 任一核心依赖不可用,标记为不就绪
return Health.down()
.withDetail("llm", llm.getStatus())
.withDetail("llm_detail", llm.getDetails())
.withDetail("vectorStore", vectorStore.getStatus())
.build();
}
}# application.yml:配置健康检查分组
management:
health:
# 就绪探针:包含AI依赖检查
readinessState:
enabled: true
livenessState:
enabled: true
endpoint:
health:
group:
readiness:
include:
- readinessState
- ai-readiness # 自定义的AI就绪检查
- db # 数据库
liveness:
include:
- livenessState # 只检查JVM状态# Kubernetes deployment配置
livenessProbe:
httpGet:
path: /actuator/health/liveness
port: 8080
initialDelaySeconds: 30
periodSeconds: 30
failureThreshold: 3
readinessProbe:
httpGet:
path: /actuator/health/readiness
port: 8080
initialDelaySeconds: 10
periodSeconds: 15
failureThreshold: 2 # 2次失败就摘除(比liveness更敏感)优雅降级:部分健康的处理
AI服务的各个功能对依赖的要求不同:
- 查询订单(工具调用):不需要向量数据库
- RAG问答:需要向量数据库
- 基础对话:只需要LLM API
可以实现细粒度的功能降级,而不是全服务不可用:
@Service
@RequiredArgsConstructor
public class AiFeatureFlags {
private final LlmApiHealthIndicator llmHealth;
private final VectorStoreHealthIndicator vectorStoreHealth;
public boolean isBasicChatAvailable() {
Status status = llmHealth.health().getStatus();
return status.equals(Status.UP) || status.getCode().equals("SLOW");
}
public boolean isRagAvailable() {
return isBasicChatAvailable() && vectorStoreHealth.health().getStatus().equals(Status.UP);
}
public AvailableFeatures getAvailableFeatures() {
boolean chat = isBasicChatAvailable();
boolean rag = isRagAvailable();
return new AvailableFeatures(chat, rag);
}
public record AvailableFeatures(boolean basicChat, boolean rag) {}
}
// Controller中使用
@PostMapping("/ask")
public ResponseEntity<AiResponse> ask(@RequestBody AskRequest request) {
AvailableFeatures features = featureFlags.getAvailableFeatures();
if (!features.basicChat()) {
return ResponseEntity.serviceUnavailable()
.body(AiResponse.degraded("AI服务暂时不可用,请稍后重试"));
}
if (request.needsKnowledgeBase() && !features.rag()) {
// 知识库不可用时,降级为普通对话(告知用户)
return ResponseEntity.ok(
chatService.chatWithoutKnowledge(request.message(), "(注:知识库暂时不可用,本回答基于模型内置知识)"));
}
return ResponseEntity.ok(ragService.ask(request.message()));
}健康检查的监控告警
健康状态变化要触发告警:
@Component
@Slf4j
public class HealthChangeAlertListener implements ApplicationListener<HealthChangedEvent> {
private final AlertService alertService;
private volatile Status lastStatus = Status.UP;
@Scheduled(fixedDelay = 30000) // 每30秒检查一次
public void checkHealthChange() {
// 通过Actuator获取当前健康状态
// 简化示意:实际通过注入各个HealthIndicator
}
@EventListener
public void onHealthChanged(HealthChangedEvent event) {
Status newStatus = event.getHealth().getStatus();
if (!newStatus.equals(lastStatus)) {
if (Status.UP.equals(lastStatus) && !Status.UP.equals(newStatus)) {
// 从健康变为不健康
alertService.sendAlert(String.format(
"[告警] AI服务健康状态变化:%s -> %s\n详情:%s",
lastStatus, newStatus, event.getHealth().getDetails()));
} else if (Status.UP.equals(newStatus)) {
// 恢复健康
alertService.sendAlert("[恢复] AI服务已恢复正常");
}
lastStatus = newStatus;
}
}
}AI服务的健康检查设计,核心原则是:不要让基础设施的健康掩盖AI功能的不可用。用户感知的健康才是真正的健康。
