AI 应用的灾备设计——OpenAI 宕机了你的系统会怎样

老张2026/4/30大约 8 分钟

AI 应用的灾备设计——OpenAI 宕机了你的系统会怎样

2024 年 12 月某天凌晨，我们的 AI 写作辅助工具突然开始大规模报错。用户反馈蜂拥而来。我打开监控一看，所有调用 OpenAI API 的接口都返回 503。

去 OpenAI 的 status 页面确认，果然——他们的 API 服务在欧美时间的工作时间出现了大规模故障，持续了将近两个小时。

两小时的故障，换来了几百条用户投诉和一波批量退订。

那次事故之后，我用了两周时间重新设计了整个 AI 服务调用架构。核心原则只有一个：任何一个外部 AI 服务宕机，我们的系统都不能对用户直接报错。

一、依赖单一云端 API 的风险有多大

很多团队对「单点依赖」的风险有直觉认知，但缺乏量化感受。我来做个简单估算。

OpenAI API 的可用性官方没有明确的 SLA，从历史事故记录来看，大概率在 99.5% 到 99.9% 之间。我们取中间值 99.7%。

99.7% 的可用性意味着：每年约有 26 小时的不可用时间。如果你的 AI 功能是核心功能，这意味着用户每年有约 26 小时完全用不了你的产品。

更坏的是：这 26 小时不会均匀分布，它会在某个你最不希望它发生的时间点集中爆发。

风险来源不只有「宕机」：

模型版本突然变更，行为不一致（GPT-3.5 升级到 GPT-4 的行为差异就经历过这种情况）
API 涨价，影响成本结构
限流策略收紧，高峰期 429 错误激增
特定地区的网络访问不稳定（这个在国内尤为常见）

二、三层降级架构设计

针对上面的风险，我设计了一个三层降级架构：

第一层：主模型 选择效果最好的云端模型（通常是 GPT-4o 或 Claude Sonnet）。这是正常情况下的首选。

第二层：备用模型 选择不同供应商的模型（如果主模型是 OpenAI，备用模型选 Anthropic 或 Google Gemini）。两个来自不同供应商的模型同时宕机的概率极低。

第三层：本地降级模型 部署一个轻量的开源模型（Ollama + Qwen2.5-7B 或 LLaMA3-8B）。这一层的能力有限，但总比直接报错强。主要用于处理简单问答、格式化等不需要强推理的任务。

兜底：静态降级响应 连本地模型都挂了（概率极低），返回预定义的静态响应，引导用户使用其他方式或等待服务恢复。

三、降级时的用户体验设计

很多工程师做降级设计只关注技术层面，忽略了用户体验。我见过的最烂的降级策略是：主模型失败 → 直接返回 500 错误。

稍好一点但依然不够的：主模型失败 → 备用模型 → 如果备用也失败就返回「服务暂时不可用」。

好的降级体验应该是：用户感知不到或降级是透明的。

但现实是，降级模型的能力通常不如主模型，用户是会感知到的。这时候正确的策略是：诚实 + 设置预期。

具体来说：

如果切换到备用云端模型，通常用户感知不到质量差异，可以无缝切换
如果切换到本地小模型，质量会下降，应该在 UI 层提示「当前使用精简版 AI，回答质量可能有所降低」
如果完全降级到静态响应，要明确告知「AI 服务暂时维护，预计 X 分钟后恢复」，并提供备选方案

四、带降级策略的 AI 调用封装

4.1 模型配置定义

@Data
@ConfigurationProperties(prefix = "ai.models")
@Configuration
public class AiModelsConfig {

    private ModelConfig primary;
    private ModelConfig fallback;
    private LocalModelConfig local;
    private StaticFallbackConfig staticFallback;

    @Data
    public static class ModelConfig {
        private String provider;   // openai / anthropic / gemini
        private String modelId;
        private String apiKey;
        private String baseUrl;
        private int timeoutSeconds = 30;
        private int maxRetries = 2;
    }

    @Data
    public static class LocalModelConfig {
        private boolean enabled = false;
        private String ollamaBaseUrl = "http://localhost:11434";
        private String modelId = "qwen2.5:7b";
        private int timeoutSeconds = 60;
    }

    @Data
    public static class StaticFallbackConfig {
        private String message = "AI 服务暂时维护中，请稍后再试";
        private boolean enabled = true;
    }
}

对应的 application.yml：

ai:
  models:
    primary:
      provider: openai
      model-id: gpt-4o
      api-key: ${OPENAI_API_KEY}
      base-url: https://api.openai.com
      timeout-seconds: 30
      max-retries: 2
    fallback:
      provider: anthropic
      model-id: claude-3-5-sonnet-20241022
      api-key: ${ANTHROPIC_API_KEY}
      base-url: https://api.anthropic.com
      timeout-seconds: 30
      max-retries: 2
    local:
      enabled: true
      ollama-base-url: http://localhost:11434
      model-id: qwen2.5:7b
      timeout-seconds: 60
    static-fallback:
      enabled: true
      message: "AI 服务暂时维护中，请稍后再试。如需帮助，请联系客服。"

4.2 熔断器配置

@Configuration
public class CircuitBreakerConfig {

    @Bean
    public CircuitBreakerRegistry circuitBreakerRegistry() {
        // 主模型熔断器：5次失败后熔断，10秒后半开
        io.github.resilience4j.circuitbreaker.CircuitBreakerConfig primaryConfig =
            io.github.resilience4j.circuitbreaker.CircuitBreakerConfig.custom()
                .failureRateThreshold(50)           // 失败率超过 50% 触发熔断
                .waitDurationInOpenState(Duration.ofSeconds(30))  // 熔断等待 30 秒
                .slidingWindowSize(10)              // 滑动窗口 10 次请求
                .minimumNumberOfCalls(5)            // 至少 5 次请求才统计
                .permittedNumberOfCallsInHalfOpenState(3)
                .build();

        io.github.resilience4j.circuitbreaker.CircuitBreakerConfig fallbackConfig =
            io.github.resilience4j.circuitbreaker.CircuitBreakerConfig.custom()
                .failureRateThreshold(70)
                .waitDurationInOpenState(Duration.ofSeconds(20))
                .slidingWindowSize(5)
                .minimumNumberOfCalls(3)
                .build();

        return CircuitBreakerRegistry.of(Map.of(
            "primary-model", primaryConfig,
            "fallback-model", fallbackConfig
        ));
    }
}

4.3 三层降级 AI 调用服务

@Service
@Slf4j
public class ResilientAiService {

    @Autowired
    private AiModelsConfig modelsConfig;

    @Autowired
    private CircuitBreakerRegistry circuitBreakerRegistry;

    @Autowired
    private MeterRegistry meterRegistry;

    private final CircuitBreaker primaryCb;
    private final CircuitBreaker fallbackCb;

    public ResilientAiService(CircuitBreakerRegistry circuitBreakerRegistry) {
        this.primaryCb = circuitBreakerRegistry.circuitBreaker("primary-model");
        this.fallbackCb = circuitBreakerRegistry.circuitBreaker("fallback-model");
    }

    /**
     * 带三层降级的 AI 调用
     * 返回值包含实际使用的模型层级，供上层做用户提示
     */
    public AiResponse callWithFallback(AiRequest request) {
        // 第一层：主模型
        try {
            return Decorators.ofSupplier(() -> callPrimaryModel(request))
                    .withCircuitBreaker(primaryCb)
                    .withRetry(buildRetry("primary"))
                    .decorate()
                    .get();
        } catch (Exception primaryEx) {
            log.warn("Primary model failed: {}", primaryEx.getMessage());
            meterRegistry.counter("ai.fallback.trigger",
                    "layer", "primary", "reason", classifyError(primaryEx)).increment();
        }

        // 第二层：备用云端模型
        try {
            return Decorators.ofSupplier(() -> callFallbackModel(request))
                    .withCircuitBreaker(fallbackCb)
                    .withRetry(buildRetry("fallback"))
                    .decorate()
                    .get()
                    .withModelLayer(ModelLayer.FALLBACK);  // 标记使用了备用模型
        } catch (Exception fallbackEx) {
            log.warn("Fallback model failed: {}", fallbackEx.getMessage());
            meterRegistry.counter("ai.fallback.trigger",
                    "layer", "fallback", "reason", classifyError(fallbackEx)).increment();
        }

        // 第三层：本地模型
        if (modelsConfig.getLocal().isEnabled()) {
            try {
                AiResponse localResponse = callLocalModel(request);
                log.info("Using local model for request: {}", request.getRequestId());
                return localResponse.withModelLayer(ModelLayer.LOCAL);
            } catch (Exception localEx) {
                log.error("Local model also failed: {}", localEx.getMessage());
                meterRegistry.counter("ai.fallback.trigger",
                        "layer", "local", "reason", classifyError(localEx)).increment();
            }
        }

        // 兜底：静态响应
        log.error("All AI models failed, returning static fallback for request: {}",
                request.getRequestId());
        meterRegistry.counter("ai.fallback.trigger", "layer", "static", "reason", "all_failed").increment();
        return buildStaticFallback(request);
    }

    private AiResponse callPrimaryModel(AiRequest request) {
        AiModelsConfig.ModelConfig config = modelsConfig.getPrimary();
        // 实际调用逻辑，根据 provider 选择不同的客户端
        return getClientForProvider(config).chat(request, config);
    }

    private AiResponse callFallbackModel(AiRequest request) {
        AiModelsConfig.ModelConfig config = modelsConfig.getFallback();
        return getClientForProvider(config).chat(request, config);
    }

    private AiResponse callLocalModel(AiRequest request) {
        // 调用 Ollama 本地服务
        AiModelsConfig.LocalModelConfig localConfig = modelsConfig.getLocal();
        OllamaClient ollamaClient = new OllamaClient(localConfig.getOllamaBaseUrl());

        // 本地模型能力有限，对过长的请求做截断
        AiRequest simplifiedRequest = request.truncateContextTo(2000);
        return ollamaClient.chat(simplifiedRequest, localConfig.getModelId());
    }

    private AiResponse buildStaticFallback(AiRequest request) {
        String message = modelsConfig.getStaticFallback().getMessage();
        return AiResponse.builder()
                .content(message)
                .modelLayer(ModelLayer.STATIC)
                .requestId(request.getRequestId())
                .build();
    }

    private String classifyError(Exception e) {
        if (e instanceof TimeoutException) return "timeout";
        if (e instanceof CallNotPermittedException) return "circuit_open";
        if (e.getMessage() != null && e.getMessage().contains("429")) return "rate_limited";
        if (e.getMessage() != null && e.getMessage().contains("503")) return "service_unavailable";
        return "unknown";
    }

    private Retry buildRetry(String name) {
        return Retry.of(name, RetryConfig.custom()
                .maxAttempts(2)
                .waitDuration(Duration.ofMillis(500))
                .retryExceptions(IOException.class, TimeoutException.class)
                .ignoreExceptions(CallNotPermittedException.class)  // 熔断器开启时不重试
                .build());
    }
}

4.4 Controller 层的用户体验处理

@RestController
@RequestMapping("/api/chat")
public class ChatController {

    @Autowired
    private ResilientAiService aiService;

    @PostMapping
    public ChatResponse chat(@RequestBody ChatRequest request) {
        AiResponse aiResponse = aiService.callWithFallback(
            AiRequest.from(request)
        );

        // 根据使用的模型层级，决定是否添加提示信息
        ChatResponse response = ChatResponse.from(aiResponse);

        switch (aiResponse.getModelLayer()) {
            case PRIMARY:
                // 正常情况，无需额外提示
                break;
            case FALLBACK:
                // 切换了备用模型，效果和主模型基本一致，可以不提示
                // 但记录下来，用于后续分析
                log.info("Request {} served by fallback model", request.getRequestId());
                break;
            case LOCAL:
                // 切换到本地小模型，质量可能下降，需要提示用户
                response.setSystemNotice("当前 AI 服务存在波动，正在使用精简版模式，回答质量可能有所影响。");
                response.setDegraded(true);
                break;
            case STATIC:
                // 静态降级，明确告知用户
                response.setSystemNotice("AI 服务暂时维护中，请稍后再试。");
                response.setDegraded(true);
                response.setFailed(true);
                break;
        }

        return response;
    }
}

五、健康检查与自动恢复

光有降级还不够，还需要一个机制在主模型恢复后自动切回：

@Component
@Slf4j
public class AiModelHealthChecker {

    @Autowired
    private ResilientAiService aiService;

    @Autowired
    private CircuitBreakerRegistry circuitBreakerRegistry;

    // 每30秒探测一次主模型可用性
    @Scheduled(fixedDelay = 30000)
    public void probeModelHealth() {
        CircuitBreaker primaryCb = circuitBreakerRegistry.circuitBreaker("primary-model");

        // 只有在熔断器处于 OPEN 状态时才主动探测
        if (primaryCb.getState() != CircuitBreaker.State.OPEN) {
            return;
        }

        log.info("Primary model circuit breaker is OPEN, probing health...");

        try {
            // 发送一个简单的探测请求（短 Prompt，低成本）
            AiRequest healthCheck = AiRequest.builder()
                    .prompt("Say 'OK'")
                    .maxTokens(5)
                    .requestId("health-probe-" + System.currentTimeMillis())
                    .build();

            // 直接调用主模型，绕过熔断器
            aiService.callPrimaryModelDirect(healthCheck);

            // 如果成功，手动将熔断器切换到半开状态
            log.info("Primary model health probe succeeded, transitioning to HALF_OPEN");
            primaryCb.transitionToHalfOpenState();

        } catch (Exception e) {
            log.warn("Primary model health probe failed: {}", e.getMessage());
        }
    }
}

六、成本与收益分析

做这套架构需要投入：

备用模型的 API Key 和成本（通常是备用模式，实际消耗的费用很少）
本地模型的服务器资源（一台 16G 内存的机器可以运行 7B 模型）
代码复杂度增加（上面这套代码大概 500 行）

收益是：

彻底消除单点依赖
主模型宕机时用户感知不到（直接切换备用云端模型）
极端情况下（所有云端模型都挂了）仍有本地模型兜底

从 ROI 的角度看，一次两小时的故障造成的用户流失和口碑损失，远大于这套架构的建设成本。

总结

AI 应用灾备设计的核心是三层降级：

多云端模型互为备份（不同供应商）
本地模型作为最后的云端替代
静态响应作为兜底，让用户知道发生了什么

技术实现上，Resilience4j 的熔断器 + 重试机制是核心工具。用户体验上，要区分不同降级层级，只有质量明显下降时才告知用户，避免过度焦虑。

不要等到宕机那天才来做这件事。