第2472篇：智能监控告警——从规则驱动到AI驱动的告警系统升级

老张2026/4/30大约 6 分钟

第2472篇：智能监控告警——从规则驱动到AI驱动的告警系统升级

适读人群：SRE、运维工程师、后端工程师 | 阅读时长：约16分钟 | 核心价值：把静态规则告警系统升级为能自适应基线、抑制噪音的智能告警系统

有一次我值班，凌晨两点收到了37条告警短信。

这37条告警说的其实是一件事：数据库从节点同步延迟增高，触发了7个不同的告警规则，每个规则触发后又通过不同渠道发了多条通知。等我爬起来看完这37条，才搞清楚是一件事，花了5分钟。

另一次，一个重要接口的成功率从99.9%降到了98.5%，在正常的告警阈值（95%）以上，没有触发任何告警，但我们损失了大量请求。

这两个极端——告警风暴和告警盲区——是规则驱动告警系统的根本性问题。

规则告警系统的三个根本缺陷

缺陷一：静态阈值不适应业务周期。一个指标的"正常值"不是固定的，早上9点的请求量自然比凌晨3点高10倍，用同一个阈值会导致早高峰永远在告警，凌晨永远漏报。

缺陷二：同一故障触发多条告警。一次数据库故障可能同时影响10个依赖这个数据库的服务，每个服务的每个相关指标都可能触发告警，形成告警风暴，让值班人员无法快速判断根源。

缺陷三：告警没有上下文。短信说"接口成功率低于95%"，但值班人员不知道这是刚发生的还是已经持续了1小时，不知道受影响的用户数，不知道最近有没有发布，需要自己去各个系统查。

AI驱动的告警系统可以解决这三个问题。

智能告警系统架构

核心实现

1. 自适应基线引擎

@Component
public class AdaptiveBaselineEngine {
    
private final TimeSeriesRepository tsRepository;
    private final Cache<String, BaselineModel> baselineCache;
    
    public AdaptiveBaselineEngine() {
        this.baselineCache = Caffeine.newBuilder()
            .expireAfterWrite(10, TimeUnit.MINUTES)
            .maximumSize(50000)
            .build();
    }
    
    /**
     * 计算动态基线：考虑时间周期性（小时级、日级、周级）
     * 使用最近4周的同期数据建立基线，而不是简单的历史平均
     */
    public DynamicBaseline computeBaseline(String metricKey, Instant now) {
        return baselineCache.get(metricKey + "_" + getTimeSlot(now), key -> {
            DayOfWeek dayOfWeek = now.atZone(ZoneId.systemDefault()).getDayOfWeek();
            int hourOfDay = now.atZone(ZoneId.systemDefault()).getHour();
            
            // 获取过去4周同一天、同一小时的数据
            List<Double> historicalValues = new ArrayList<>();
            for (int weekBack = 1; weekBack <= 4; weekBack++) {
                Instant historicalTime = now.minus(weekBack * 7, ChronoUnit.DAYS);
                List<Double> weekValues = tsRepository.getValuesInWindow(
                    metricKey,
                    historicalTime.minus(30, ChronoUnit.MINUTES),
                    historicalTime.plus(30, ChronoUnit.MINUTES)
                );
                historicalValues.addAll(weekValues);
            }
            
            if (historicalValues.size() < 10) {
                return DynamicBaseline.insufficient();
            }
            
            // 去除异常值（Winsorization：截断最高5%和最低5%）
            List<Double> cleanedValues = winsorize(historicalValues, 0.05);
            
            double mean = computeMean(cleanedValues);
            double stdDev = computeStdDev(cleanedValues, mean);
            
            // 使用更宽松的动态阈值（4-sigma，而不是3-sigma）
            // 因为正常的业务波动比较大
            return DynamicBaseline.of(
                mean,
                stdDev,
                mean - 4 * stdDev,  // 下界
                mean + 4 * stdDev,  // 上界
                cleanedValues.size()
            );
        });
    }
    
    public AnomalyScore score(String metricKey, double currentValue, Instant now) {
        DynamicBaseline baseline = computeBaseline(metricKey, now);
        
        if (baseline.isInsufficient()) {
            return AnomalyScore.uncertain(metricKey);
        }
        
        double zScore = (currentValue - baseline.getMean()) / 
                        Math.max(baseline.getStdDev(), baseline.getMean() * 0.01);
        
        // 判断方向：某些指标超高是问题（如错误率），某些指标超低是问题（如成功率）
        boolean isAnomaly = Math.abs(zScore) > 4.0;
        double severity = Math.min(1.0, (Math.abs(zScore) - 4.0) / 4.0);
        
        return AnomalyScore.of(metricKey, currentValue, baseline, zScore, isAnomaly, severity);
    }
    
    private List<Double> winsorize(List<Double> values, double percentile) {
        List<Double> sorted = new ArrayList<>(values);
        Collections.sort(sorted);
        int trimCount = (int)(sorted.size() * percentile);
        return sorted.subList(trimCount, sorted.size() - trimCount);
    }
}

2. 告警聚合引擎

@Service
public class AlertAggregationEngine {
    
    private final ServiceDependencyGraph dependencyGraph;
    private final Map<String, AlertGroup> activeGroups = new ConcurrentHashMap<>();
    
    /**
     * 告警聚合：把来自同一根因的多个告警合并为一个事件
     */
    public AlertGroup aggregate(Alert newAlert) {
        // 找到与这个告警相关的活跃告警组
        Optional<AlertGroup> existingGroup = findRelatedGroup(newAlert);
        
        if (existingGroup.isPresent()) {
            // 加入已有的告警组
            AlertGroup group = existingGroup.get();
            group.addAlert(newAlert);
            return group;
        } else {
            // 创建新的告警组
            AlertGroup newGroup = AlertGroup.create(newAlert);
            activeGroups.put(newGroup.getId(), newGroup);
            return newGroup;
        }
    }
    
    private Optional<AlertGroup> findRelatedGroup(Alert newAlert) {
        // 关联逻辑：
        // 1. 时间相关：在10分钟内出现的告警
        // 2. 服务相关：在依赖图上相互关联的服务
        // 3. 指标相关：同类指标（所有服务的错误率、所有数据库的延迟等）
        
        Instant windowStart = Instant.now().minus(10, ChronoUnit.MINUTES);
        
        return activeGroups.values().stream()
            .filter(g -> g.getCreatedAt().isAfter(windowStart))
            .filter(g -> isServiceRelated(newAlert.getServiceId(), g))
            .max(Comparator.comparingDouble(g -> computeRelatedness(newAlert, g)));
    }
    
    private boolean isServiceRelated(String serviceId, AlertGroup group) {
        // 检查服务依赖关系
        Set<String> groupServices = group.getAlerts().stream()
            .map(Alert::getServiceId)
            .collect(toSet());
        
        for (String groupService : groupServices) {
            if (dependencyGraph.areDirectlyConnected(serviceId, groupService)) {
                return true;
            }
        }
        
        return false;
    }
    
    private double computeRelatedness(Alert alert, AlertGroup group) {
        // 计算告警和告警组的关联度
        long sharedTimeWindow = group.getAlerts().stream()
            .filter(a -> Math.abs(a.getTriggeredAt().toEpochMilli() - 
                                  alert.getTriggeredAt().toEpochMilli()) < 300000) // 5分钟内
            .count();
        
        boolean sameMetricFamily = group.getAlerts().stream()
            .anyMatch(a -> isSameMetricFamily(a.getMetricType(), alert.getMetricType()));
        
        return sharedTimeWindow * 0.5 + (sameMetricFamily ? 0.5 : 0);
    }
}

3. LLM上下文丰富

@Service
public class AlertContextEnricher {
    
    private final ChatClient chatClient;
    private final MetricsRepository metricsRepository;
    private final DeploymentHistoryRepository deploymentHistory;
    private final IncidentRepository incidentRepo;
    
    /**
     * 在发送告警前，自动收集相关上下文
     * 让值班人员收到告警时已经有足够信息快速判断
     */
    public EnrichedAlert enrich(AlertGroup alertGroup) {
        String context = gatherContext(alertGroup);
        
        String analysisPrompt = """
            以下是一组相关告警，请分析并生成简洁的告警摘要：
            
            %s
            
            请生成：
            1. 一句话总结：正在发生什么（面向值班工程师）
            2. 影响评估：大概影响多少用户/请求
            3. 最可能的原因（1-3个可能性）
            4. 建议的第一步操作
            5. 紧急程度（CRITICAL/HIGH/MEDIUM/LOW）
            
            输出要简洁，值班人员30秒内要能看完并行动。
            """.formatted(context);
        
        ChatResponse response = chatClient.call(new Prompt(
            List.of(
                new SystemMessage("你是一个经验丰富的SRE，专门做告警分析和快速响应。"),
                new UserMessage(analysisPrompt)
            )
        ));
        
        String aiAnalysis = response.getResult().getOutput().getContent();
        
        return EnrichedAlert.builder()
            .alertGroup(alertGroup)
            .aiSummary(aiAnalysis)
            .rawContext(context)
            .enrichedAt(Instant.now())
            .build();
    }
    
    private String gatherContext(AlertGroup alertGroup) {
        StringBuilder ctx = new StringBuilder();
        
        // 告警列表
        ctx.append("## 告警列表\n");
        alertGroup.getAlerts().forEach(a -> 
            ctx.append(String.format("- [%s] %s: 当前值=%.2f, 阈值=%.2f\n",
                a.getSeverity(), a.getMetricName(), a.getCurrentValue(), a.getThreshold()))
        );
        
        // 最近的部署记录
        ctx.append("\n## 最近6小时部署记录\n");
        List<DeploymentRecord> recentDeployments = deploymentHistory.getRecent(
            alertGroup.getServiceIds(), Duration.ofHours(6)
        );
        if (recentDeployments.isEmpty()) {
            ctx.append("无部署记录\n");
        } else {
            recentDeployments.forEach(d -> 
                ctx.append(String.format("- [%s] %s 发布了 %s\n",
                    d.getTimestamp(), d.getServiceName(), d.getVersion()))
            );
        }
        
        // 历史类似事件
        ctx.append("\n## 历史相似事件（最近3个月）\n");
        List<IncidentRecord> similar = incidentRepo.findSimilar(
            alertGroup.getServiceIds(), alertGroup.getAlertTypes(), 3
        );
        similar.forEach(inc -> 
            ctx.append(String.format("- [%s] 根因: %s, 处置时长: %d分钟\n",
                inc.getTimestamp(), inc.getRootCause(), inc.getMttrMinutes()))
        );
        
        return ctx.toString();
    }
}

告警疲劳改善效果

我们实施智能告警后的数字变化（6个月后对比）：

告警通知数量：日均从320条降到41条（-87%）
告警到响应时间（MTTA）：从平均12分钟降到5分钟
误报率：从45%降到8%
告警里有AI摘要后，工程师定位问题速度提升约60%

最重要的指标其实是工程师的夜间睡眠质量——凌晨被叫醒的频率降低了，每次被叫醒时能更快做出判断，整体压力小了很多。这种东西没法量化，但对团队健康很重要。