第2169篇：AI系统的故障后复盘——从每次事故中提取工程改进

老张2026/4/30大约 7 分钟

第2169篇：AI系统的故障后复盘——从每次事故中提取工程改进

适读人群：负责AI系统运维和质量改进的工程师 | 阅读时长：约17分钟 | 核心价值：建立系统化的AI事故复盘机制，把每次故障变成工程改进的机会，而不是扯皮推责

凌晨1点，AI客服系统开始给用户返回一堆乱码和重复内容。持续了40分钟，影响了几千个用户，才被运营发现报警修复。

事后开了一个小时的复盘会。大家花了50分钟在讨论"是谁的问题"，最后花了10分钟草草写了个"加强监控"的改进计划。两个月后，类似的事情又发生了。

这是AI系统复盘的典型失败模式：复盘变成追责，改进计划没有落地，同类问题重复发生。

AI系统的故障复盘需要不同的方法——因为AI系统的故障往往不是某个人的失误，而是系统设计的盲点。

AI事故的独特性

AI系统的故障不同于传统软件故障：

传统软件故障：
- 通常有明确的错误信息（NPE、OOM、SQL异常）
- 原因相对清楚，好定位
- 修复方案相对确定

AI系统故障的特点：
1. 模糊性：没有报错，系统"运行正常"但输出质量差
2. 不确定性：同样的操作可能引发，也可能不引发故障
3. 多因素交织：模型、Prompt、数据、基础设施可能同时有问题
4. 滞后性：质量下降可能在发生后几小时才被发现
5. 难以复现：问题可能与特定输入有关，很难稳定复现

事故响应与数据收集

复盘的基础是完整的事故数据，这要求在事故发生时就做好数据收集。

/**
 * 事故数据收集服务
 * 
 * 在事故发生时，自动收集诊断所需的关键数据
 */
@Service
@RequiredArgsConstructor
@Slf4j
public class IncidentDataCollector {

    private final InteractionLogRepository logRepository;
    private final EvaluationResultRepository evalRepository;
    private final SystemMetricsRepository metricsRepository;
    private final ModelConfigRepository configRepository;

    /**
     * 当告警触发时，自动收集事故上下文
     * 
     * @param incidentId 事故ID
     * @param startTime  事故开始时间
     * @param endTime    事故结束时间
     */
    public IncidentContext collectContext(String incidentId, Instant startTime, Instant endTime) {
        log.info("开始收集事故数据: incidentId={}, period={} to {}", incidentId, startTime, endTime);
        
        // 1. 收集受影响的交互日志
        List<InteractionLog> affectedInteractions = logRepository.findBetween(startTime, endTime);
        
        // 2. 收集质量评估数据
        List<EvaluationReport> qualityReports = evalRepository.findBetween(startTime, endTime);
        
        // 3. 收集系统指标（延迟、错误率、Token消耗等）
        SystemMetricsSnapshot metricsSnapshot = metricsRepository.snapshot(startTime, endTime);
        
        // 4. 收集事故前后的配置变更
        List<ConfigChange> recentConfigChanges = configRepository.findChanges(
            startTime.minus(Duration.ofHours(24)), endTime
        );
        
        // 5. 识别受影响的范围
        ImpactAssessment impact = assessImpact(affectedInteractions, qualityReports);
        
        // 6. 识别异常模式
        List<AnomalyPattern> anomalies = detectAnomalies(affectedInteractions, qualityReports, metricsSnapshot);
        
        IncidentContext context = IncidentContext.builder()
            .incidentId(incidentId)
            .startTime(startTime)
            .endTime(endTime)
            .durationMinutes(Duration.between(startTime, endTime).toMinutes())
            .affectedInteractionCount(affectedInteractions.size())
            .sampleAffectedInteractions(sampleRepresentative(affectedInteractions, 20))
            .qualityMetrics(summarizeQuality(qualityReports))
            .systemMetrics(metricsSnapshot)
            .recentConfigChanges(recentConfigChanges)
            .impact(impact)
            .anomalyPatterns(anomalies)
            .build();
        
        // 保存事故上下文
        logRepository.saveIncidentContext(context);
        
        log.info("事故数据收集完成: 受影响{}条交互, {}个异常模式", 
            affectedInteractions.size(), anomalies.size());
        
        return context;
    }

    private ImpactAssessment assessImpact(List<InteractionLog> interactions, 
                                           List<EvaluationReport> reports) {
        long negativeInteractions = interactions.stream()
            .filter(i -> i.getUserFeedback() != null && i.getUserFeedback().isNegative())
            .count();
        
        double avgQualityDuringIncident = reports.stream()
            .mapToDouble(EvaluationReport::getOverallScore)
            .average().orElse(0);
        
        double passingRate = reports.stream()
            .mapToDouble(r -> r.isPassed() ? 1.0 : 0.0)
            .average().orElse(0);
        
        return ImpactAssessment.builder()
            .totalAffectedUsers(interactions.stream().map(InteractionLog::getUserId).distinct().count())
            .negativeInteractions(negativeInteractions)
            .avgQualityScore(avgQualityDuringIncident)
            .qualityPassRate(passingRate)
            .severityLevel(classifySeverity(negativeInteractions, passingRate))
            .build();
    }

    private List<AnomalyPattern> detectAnomalies(List<InteractionLog> interactions,
                                                   List<EvaluationReport> reports,
                                                   SystemMetricsSnapshot metrics) {
        List<AnomalyPattern> anomalies = new ArrayList<>();
        
        // 检测时间模式（哪个时间段最严重）
        if (!reports.isEmpty()) {
            Map<Integer, Double> scoreByHour = reports.stream()
                .collect(Collectors.groupingBy(
                    r -> r.getTimestamp().atZone(ZoneId.systemDefault()).getHour(),
                    Collectors.averagingDouble(EvaluationReport::getOverallScore)
                ));
            
            scoreByHour.forEach((hour, score) -> {
                if (score < 0.5) {
                    anomalies.add(AnomalyPattern.builder()
                        .type("TIME_PATTERN")
                        .description(String.format("在%d点附近，质量分数下降到%.2f", hour, score))
                        .severity("HIGH")
                        .build());
                }
            });
        }
        
        // 检测特定意图的问题
        Map<String, Long> failuresByIntent = reports.stream()
            .filter(r -> !r.isPassed())
            .filter(r -> r.getIntentLabel() != null)
            .collect(Collectors.groupingBy(EvaluationReport::getIntentLabel, Collectors.counting()));
        
        failuresByIntent.forEach((intent, count) -> {
            if (count > 10) {
                anomalies.add(AnomalyPattern.builder()
                    .type("INTENT_PATTERN")
                    .description(String.format("意图[%s]有%d个失败案例，可能是该场景特有的问题", intent, count))
                    .severity("MEDIUM")
                    .affectedIntent(intent)
                    .build());
            }
        });
        
        // 检测输出异常（重复输出、乱码等）
        long repetitiveOutputs = interactions.stream()
            .filter(i -> i.getOutput() != null && isRepetitive(i.getOutput()))
            .count();
        
        if (repetitiveOutputs > interactions.size() * 0.1) {
            anomalies.add(AnomalyPattern.builder()
                .type("OUTPUT_REPETITION")
                .description(String.format("%.1f%%的输出出现重复内容，可能是模型解码异常", 
                    (double) repetitiveOutputs / interactions.size() * 100))
                .severity("HIGH")
                .build());
        }
        
        return anomalies;
    }

    private boolean isRepetitive(String output) {
        String[] parts = output.split("。|！|？");
        Set<String> seen = new HashSet<>();
        for (String p : parts) {
            String t = p.trim();
            if (t.length() > 10 && !seen.add(t)) return true;
        }
        return false;
    }

    private IncidentSeverity classifySeverity(long negativeCount, double passRate) {
        if (passRate < 0.3 || negativeCount > 100) return IncidentSeverity.P1;
        if (passRate < 0.5 || negativeCount > 20) return IncidentSeverity.P2;
        return IncidentSeverity.P3;
    }

    private List<InteractionLog> sampleRepresentative(List<InteractionLog> logs, int count) {
        Collections.shuffle(logs, new Random(42));
        return logs.subList(0, Math.min(count, logs.size()));
    }

    private QualityMetricsSummary summarizeQuality(List<EvaluationReport> reports) {
        if (reports.isEmpty()) return QualityMetricsSummary.empty();
        return QualityMetricsSummary.builder()
            .avgScore(reports.stream().mapToDouble(EvaluationReport::getOverallScore).average().orElse(0))
            .passRate(reports.stream().mapToDouble(r -> r.isPassed() ? 1.0 : 0.0).average().orElse(0))
            .sampleCount(reports.size())
            .build();
    }
}

系统化复盘模板

/**
 * 事故复盘报告生成器
 * 
 * 基于5-Why方法，生成结构化的复盘报告
 */
@Service
@RequiredArgsConstructor
public class PostmortemReportGenerator {

    /**
     * 生成复盘报告模板
     * 
     * 包含标准的复盘框架，确保每次复盘都能产出有价值的改进
     */
    public String generateTemplate(IncidentContext context) {
        StringBuilder sb = new StringBuilder();
        
        sb.append("# 事故复盘报告\n\n");
        sb.append("## 1. 事故摘要\n");
        sb.append(String.format("- 事故ID: %s\n", context.getIncidentId()));
        sb.append(String.format("- 发生时间: %s\n", context.getStartTime()));
        sb.append(String.format("- 持续时间: %d分钟\n", context.getDurationMinutes()));
        sb.append(String.format("- 影响用户: %d人\n\n", context.getImpact().getTotalAffectedUsers()));
        
        sb.append("## 2. 影响分析\n");
        sb.append(String.format("- 质量通过率: %.1f%%（正常水平~85%%）\n", 
            context.getImpact().getQualityPassRate() * 100));
        sb.append(String.format("- 用户负面反馈: %d条\n\n", context.getImpact().getNegativeInteractions()));
        
        sb.append("## 3. 事故时间线\n");
        sb.append("（填写：发现时间、响应时间、恢复时间，每个关键节点的操作）\n\n");
        
        sb.append("## 4. 根因分析（5-Why法）\n");
        sb.append("**问题现象**: 用户输出质量下降\n");
        sb.append("**Why 1**: 为什么输出质量下降？\n→ \n");
        sb.append("**Why 2**: 为什么会发生[Why 1的答案]？\n→ \n");
        sb.append("**Why 3**: 为什么会发生[Why 2的答案]？\n→ \n");
        sb.append("**根本原因**: \n\n");
        
        if (!context.getRecentConfigChanges().isEmpty()) {
            sb.append("**注意**: 事故前24小时内发生了以下配置变更：\n");
            context.getRecentConfigChanges().forEach(c ->
                sb.append(String.format("- %s: %s（%s）\n", c.getTimestamp(), c.getDescription(), c.getChangedBy()))
            );
            sb.append("\n");
        }
        
        sb.append("## 5. 发现的问题（技术层面）\n");
        context.getAnomalyPatterns().forEach(a ->
            sb.append(String.format("- [%s] %s\n", a.getSeverity(), a.getDescription()))
        );
        sb.append("\n");
        
        sb.append("## 6. 改进计划（每条必须有负责人和截止日期）\n");
        sb.append("| 改进项 | 类型 | 负责人 | 截止日期 | 状态 |\n");
        sb.append("|--------|------|--------|----------|------|\n");
        sb.append("| 添加[X]监控 | 监控 | | | 待处理 |\n");
        sb.append("| 建立[X]测试用例 | 测试 | | | 待处理 |\n");
        sb.append("| 修复[X]问题 | 修复 | | | 待处理 |\n\n");
        
        sb.append("## 7. 经验教训\n");
        sb.append("**我们做得好的**:\n- \n\n");
        sb.append("**我们可以改进的**:\n- \n\n");
        sb.append("**其他团队可以借鉴的**:\n- \n");
        
        return sb.toString();
    }
}

改进项的跟踪与落地

/**
 * 改进项跟踪服务
 * 
 * 确保复盘中的改进计划真正落地，而不是写完就扔
 */
@Service
@RequiredArgsConstructor
@Slf4j
public class ImprovementTrackingService {

    private final ImprovementItemRepository repository;
    private final NotificationService notificationService;

    @Scheduled(cron = "0 0 9 * * MON") // 每周一早上提醒
    public void sendWeeklyReminder() {
        List<ImprovementItem> overdueItems = repository.findOverdue();
        List<ImprovementItem> dueSoonItems = repository.findDueSoon(7); // 7天内到期
        
        if (!overdueItems.isEmpty()) {
            StringBuilder alert = new StringBuilder("以下改进项已逾期：\n");
            overdueItems.forEach(item ->
                alert.append(String.format("- [%s] %s（负责人：%s，截止：%s）\n",
                    item.getIncidentId(), item.getDescription(), 
                    item.getOwner(), item.getDueDate()))
            );
            notificationService.sendUrgentAlert("改进项逾期提醒", alert.toString());
        }
        
        if (!dueSoonItems.isEmpty()) {
            StringBuilder reminder = new StringBuilder("以下改进项即将到期（7天内）：\n");
            dueSoonItems.forEach(item ->
                reminder.append(String.format("- [%s] %s（负责人：%s，截止：%s）\n",
                    item.getIncidentId(), item.getDescription(), 
                    item.getOwner(), item.getDueDate()))
            );
            notificationService.sendReminder("改进项到期提醒", reminder.toString());
        }
    }

    @Scheduled(cron = "0 0 10 1 * *") // 每月1日
    public void generateMonthlyReport() {
        LocalDate lastMonth = LocalDate.now().minusMonths(1);
        
        List<ImprovementItem> allItems = repository.findByMonth(lastMonth);
        long completedCount = allItems.stream()
            .filter(i -> i.getStatus() == ImprovementStatus.COMPLETED).count();
        long cancelledCount = allItems.stream()
            .filter(i -> i.getStatus() == ImprovementStatus.CANCELLED).count();
        
        String report = String.format(
            "上月改进项完成情况：共%d项，完成%d项（%.1f%%），取消%d项\n",
            allItems.size(), completedCount, 
            (double) completedCount / allItems.size() * 100, cancelledCount
        );
        
        // 统计重复类型的事故
        long preventedIncidents = repository.countPotentiallyPreventedIncidents(lastMonth);
        report += String.format("改进措施预计避免的类似事故：%d次\n", preventedIncidents);
        
        notificationService.sendReport("AI系统改进月报", report);
    }
}

好的复盘文化有一个核心原则：无责备（Blameless）。事故的根本原因通常是系统设计问题，而不是个人失误。当人们不用担心被追责时，才会诚实地讲出真实发生了什么，才能找到真正的改进点。