第1769篇：智能告警降噪——用语义相似度聚合重复告警

老张2026/4/30大约 9 分钟

第1769篇：智能告警降噪——用语义相似度聚合重复告警

告警疲劳是运维团队最常见的痛点之一，但很多团队只知道有这个问题，却不知道从哪里下手解决。

这篇讲一个很具体的方向：语义相似度聚合。不是基于告警名称做字符串匹配，而是理解告警的语义内容，把真正描述同一个问题的告警归在一起。

告警降噪的几个层次

在讲语义聚合之前，先梳理一下告警降噪的完整思路，这几个层次是有先后顺序的，语义聚合是其中最高级的一层。

第一层：源头治理

检查告警规则本身是否合理。阈值设置太灵敏、没有考虑季节性模式、同一指标被多个规则重复监控……这些问题在规则层面就能解决，不需要AI。

第二层：时间窗口抑制

同一个告警在短时间内重复触发，只通知一次。大多数监控系统都有这个功能（Prometheus的repeat_interval）。

第三层：拓扑关联抑制

当根因服务告警后，抑制所有下游服务的级联告警，只保留根因告警。前提是有准确的服务拓扑图。

第四层：语义相似度聚合

把跨服务、跨指标但描述同一业务问题的告警归为一组。这是规则类方法做不了的，需要语义理解能力。

为什么字符串匹配不够用

传统的告警聚合通常基于：

相同的alertname
相同的labels组合
字符串关键词匹配

但实际告警的描述往往是这样的：

告警A: "order-service [CRITICAL] HTTP 5xx error rate exceeds 5% - current: 12.3%"
告警B: "payment-gateway [CRITICAL] High error rate detected, 503 responses increasing"  
告警C: "api-gateway [WARNING] Upstream server errors, downstream order-service returning 500"

这三条告警描述的是同一个问题（order-service故障导致的级联错误），但字符串匹配找不到它们的关联，因为措辞、格式完全不同。

语义相似度就能识别出来：这三条告警在语义上高度相关，应该归为一组。

整体方案架构

告警向量化

核心是把告警消息转换成语义向量。

@Service
@Slf4j
public class AlertEmbeddingService {
    
    @Autowired
    private OpenAiEmbeddingService embeddingService;
    
    @Data
    @Builder
    public static class EmbeddedAlert {
        private String alertId;
        private NormalizedAlert alert;
        private float[] embedding;          // 语义向量（1536维，text-embedding-3-small）
        private String normalizedText;      // 用于向量化的标准化文本
        private Instant embeddedAt;
    }
    
    public EmbeddedAlert embed(NormalizedAlert alert) {
        // 关键：生成用于向量化的标准化文本
        // 要提取关键语义信息，去除变化的数值（避免"12.3%"和"15.7%"被认为是不同的告警）
        String normalizedText = buildNormalizedText(alert);
        
        float[] embedding = embeddingService.embed(normalizedText);
        
        return EmbeddedAlert.builder()
            .alertId(alert.getAlertId())
            .alert(alert)
            .embedding(embedding)
            .normalizedText(normalizedText)
            .embeddedAt(Instant.now())
            .build();
    }
    
    private String buildNormalizedText(NormalizedAlert alert) {
        // 标准化的目标：
        // 1. 保留语义关键词（服务名、错误类型、影响）
        // 2. 去除或规范化数值（用"HIGH_VALUE"代替具体数字）
        // 3. 统一格式
        
        StringBuilder sb = new StringBuilder();
        
        // 服务名是非常重要的语义特征，但聚合时我们想跨服务聚合
        // 所以服务名权重要控制：提取服务的功能类别，而不是具体名称
        String serviceCategory = classifyService(alert.getServiceName());
        sb.append("service_category: ").append(serviceCategory).append(" ");
        
        // 告警类型
        sb.append("alert_type: ").append(normalizeAlertName(alert.getMetricName())).append(" ");
        
        // 严重程度
        sb.append("severity: ").append(alert.getSeverity().name()).append(" ");
        
        // 告警描述（去除数字）
        String descNormalized = alert.getRawMessage()
            .replaceAll("\\d+\\.\\d+%", "HIGH_PERCENTAGE")  // 百分比
            .replaceAll("\\d+ ms", "HIGH_LATENCY")           // 延迟数值
            .replaceAll("\\d+", "NUM");                      // 其他数字
        sb.append("description: ").append(descNormalized);
        
        return sb.toString();
    }
    
    private String classifyService(String serviceName) {
        // 把具体服务名映射到功能类别
        // 这样 "order-service" 和 "order-api" 都属于 "订单服务"
        if (serviceName.contains("order")) return "order_service";
        if (serviceName.contains("payment") || serviceName.contains("pay")) return "payment_service";
        if (serviceName.contains("user") || serviceName.contains("account")) return "user_service";
        if (serviceName.contains("gateway") || serviceName.contains("proxy")) return "gateway";
        if (serviceName.contains("db") || serviceName.contains("mysql") || 
            serviceName.contains("redis")) return "database";
        return "other_service";
    }
    
    private String normalizeAlertName(String alertName) {
        // 归一化告警名称
        String lower = alertName.toLowerCase();
        if (lower.contains("error_rate") || lower.contains("5xx") || 
            lower.contains("error rate")) return "error_rate_anomaly";
        if (lower.contains("latency") || lower.contains("response_time") ||
            lower.contains("slow")) return "latency_anomaly";
        if (lower.contains("cpu")) return "cpu_resource";
        if (lower.contains("memory") || lower.contains("oom")) return "memory_resource";
        if (lower.contains("connection") || lower.contains("pool")) return "connection_resource";
        if (lower.contains("disk")) return "disk_resource";
        return lower.replaceAll("[^a-z_]", "_");
    }
}

向量数据库集成（Milvus）

@Service
@Slf4j
public class AlertVectorStore {
    
    @Autowired
    private MilvusServiceClient milvusClient;
    
    private static final String COLLECTION_NAME = "alert_embeddings";
    private static final int VECTOR_DIM = 1536;
    
    @PostConstruct
    public void initCollection() {
        // 创建集合（如果不存在）
        if (!collectionExists()) {
            createCollection();
            createIndex();
        }
    }
    
    public void store(EmbeddedAlert embeddedAlert) {
        List<InsertParam.Field> fields = Arrays.asList(
            new InsertParam.Field("alert_id", 
                List.of(embeddedAlert.getAlertId())),
            new InsertParam.Field("embedding", 
                List.of(Arrays.asList(toFloatList(embeddedAlert.getEmbedding())))),
            new InsertParam.Field("service_name", 
                List.of(embeddedAlert.getAlert().getServiceName())),
            new InsertParam.Field("timestamp", 
                List.of(embeddedAlert.getEmbeddedAt().toEpochMilli())),
            new InsertParam.Field("severity",
                List.of(embeddedAlert.getAlert().getSeverity().name()))
        );
        
        InsertParam insertParam = InsertParam.newBuilder()
            .withCollectionName(COLLECTION_NAME)
            .withFields(fields)
            .build();
        
        milvusClient.insert(insertParam);
    }
    
    public List<SimilarAlert> findSimilar(float[] queryEmbedding, 
                                           float similarityThreshold,
                                           int topK,
                                           Duration timeWindow) {
        long timeWindowStart = Instant.now().minus(timeWindow).toEpochMilli();
        
        SearchParam searchParam = SearchParam.newBuilder()
            .withCollectionName(COLLECTION_NAME)
            .withMetricType(MetricType.COSINE)  // 余弦相似度
            .withOutFields(List.of("alert_id", "service_name", "severity", "timestamp"))
            .withTopK(topK)
            .withFloatVectors(List.of(Arrays.asList(toFloatList(queryEmbedding))))
            .withExpr(String.format("timestamp >= %d", timeWindowStart))  // 只查时间窗口内
            .build();
        
        SearchResultsWrapper results = new SearchResultsWrapper(
            milvusClient.search(searchParam).getData().getResults());
        
        return parseSearchResults(results, similarityThreshold);
    }
    
    private List<SimilarAlert> parseSearchResults(SearchResultsWrapper results, 
                                                    float threshold) {
        List<SimilarAlert> similar = new ArrayList<>();
        
        SearchResultsWrapper.IDScore[] scores = results.getIDScore(0);
        for (SearchResultsWrapper.IDScore score : scores) {
            if (score.getScore() >= threshold) {
                similar.add(SimilarAlert.builder()
                    .alertId(results.getFieldData("alert_id", 0)
                        .get(similar.size()).toString())
                    .similarity(score.getScore())
                    .build());
            }
        }
        
        return similar;
    }
}

告警聚合引擎

@Service
@Slf4j
public class AlertClusteringEngine {
    
    @Autowired
    private AlertEmbeddingService embeddingService;
    
    @Autowired
    private AlertVectorStore vectorStore;
    
    @Autowired
    private AlertGroupRepository groupRepository;
    
    private static final float SIMILARITY_THRESHOLD = 0.85f;
    private static final Duration CLUSTERING_TIME_WINDOW = Duration.ofMinutes(30);
    
    @Data
    @Builder
    public static class AlertGroup {
        private String groupId;
        private String groupName;      // LLM生成的聚合组名称
        private List<String> alertIds;
        private NormalizedAlert representativeAlert; // 代表性告警
        private Instant firstSeen;
        private Instant lastUpdated;
        private int totalCount;
        private List<String> affectedServices;
        private GroupStatus status;  // ACTIVE/RESOLVED
    }
    
    public AlertGroup processAlert(NormalizedAlert alert) {
        // 1. 向量化
        EmbeddedAlert embedded = embeddingService.embed(alert);
        
        // 2. 查找相似告警
        List<SimilarAlert> similar = vectorStore.findSimilar(
            embedded.getEmbedding(),
            SIMILARITY_THRESHOLD,
            10,
            CLUSTERING_TIME_WINDOW
        );
        
        // 3. 存储向量
        vectorStore.store(embedded);
        
        if (similar.isEmpty()) {
            // 没有相似告警，创建新的聚合组
            return createNewGroup(alert);
        } else {
            // 找到相似告警，加入已有聚合组
            String groupId = findGroupIdByAlertId(similar.get(0).getAlertId());
            if (groupId != null) {
                return addToExistingGroup(alert, groupId);
            } else {
                return createNewGroup(alert);
            }
        }
    }
    
    private AlertGroup createNewGroup(NormalizedAlert alert) {
        AlertGroup group = AlertGroup.builder()
            .groupId(UUID.randomUUID().toString())
            .alertIds(new ArrayList<>(List.of(alert.getAlertId())))
            .representativeAlert(alert)
            .firstSeen(alert.getTriggeredAt())
            .lastUpdated(alert.getTriggeredAt())
            .totalCount(1)
            .affectedServices(new ArrayList<>(List.of(alert.getServiceName())))
            .status(GroupStatus.ACTIVE)
            .build();
        
        // 生成聚合组的自然语言名称
        group.setGroupName(generateGroupName(group));
        
        groupRepository.save(group);
        
        log.info("创建新告警聚合组: groupId={}, name={}", 
            group.getGroupId(), group.getGroupName());
        
        return group;
    }
    
    private AlertGroup addToExistingGroup(NormalizedAlert alert, String groupId) {
        AlertGroup group = groupRepository.findById(groupId)
            .orElseThrow(() -> new IllegalStateException("Group not found: " + groupId));
        
        group.getAlertIds().add(alert.getAlertId());
        group.setLastUpdated(alert.getTriggeredAt());
        group.setTotalCount(group.getTotalCount() + 1);
        
        if (!group.getAffectedServices().contains(alert.getServiceName())) {
            group.getAffectedServices().add(alert.getServiceName());
        }
        
        groupRepository.save(group);
        
        log.info("告警加入已有聚合组: groupId={}, totalCount={}", 
            groupId, group.getTotalCount());
        
        return group;
    }
    
    private String generateGroupName(AlertGroup group) {
        // 基于代表性告警生成可读的聚合组名称
        NormalizedAlert rep = group.getRepresentativeAlert();
        String metric = normalizeMetricForDisplay(rep.getMetricName());
        String service = rep.getServiceName();
        
        return String.format("%s - %s", service, metric);
    }
}

通知决策器

聚合之后，决定什么时候、发什么通知。

@Service
@Slf4j
public class AlertNotificationDecider {
    
    @Autowired
    private DingtalkService dingtalkService;
    
    @Autowired
    private OpenAiService openAiService;
    
    // 同一聚合组最小通知间隔（避免频繁打扰）
    private final Map<String, Instant> lastNotificationTime = new ConcurrentHashMap<>();
    
    private static final Duration MIN_NOTIFICATION_INTERVAL = Duration.ofMinutes(15);
    
    public void decide(AlertGroup group, NormalizedAlert newAlert) {
        boolean isFirstAlert = group.getTotalCount() == 1;
        boolean shouldNotify = false;
        String notificationReason = "";
        
        if (isFirstAlert) {
            // 新聚合组：立即通知
            shouldNotify = true;
            notificationReason = "新告警";
        } else {
            Instant lastNotify = lastNotificationTime.get(group.getGroupId());
            
            if (lastNotify == null || 
                Duration.between(lastNotify, Instant.now()).compareTo(MIN_NOTIFICATION_INTERVAL) > 0) {
                // 超过最小间隔：发更新通知
                shouldNotify = true;
                notificationReason = String.format("聚合组持续增长（共%d条）", group.getTotalCount());
            }
            // 否则：抑制，不通知
        }
        
        if (shouldNotify) {
            String message = buildNotificationMessage(group, newAlert, notificationReason);
            dingtalkService.sendMarkdown("告警通知", message);
            lastNotificationTime.put(group.getGroupId(), Instant.now());
        } else {
            log.debug("告警已抑制: groupId={}, reason=within_min_interval", group.getGroupId());
        }
    }
    
    private String buildNotificationMessage(AlertGroup group, 
                                             NormalizedAlert trigger,
                                             String reason) {
        boolean isFirstAlert = group.getTotalCount() == 1;
        
        StringBuilder sb = new StringBuilder();
        
        if (isFirstAlert) {
            sb.append(String.format("## 🚨 新告警: %s\n\n", group.getGroupName()));
        } else {
            sb.append(String.format("## ⚡ 告警更新: %s\n\n", group.getGroupName()));
        }
        
        sb.append(String.format("**严重程度:** %s\n", trigger.getSeverity()));
        sb.append(String.format("**受影响服务:** %s\n", 
            String.join(", ", group.getAffectedServices())));
        sb.append(String.format("**聚合告警数:** %d条\n", group.getTotalCount()));
        sb.append(String.format("**首次出现:** %s\n", formatTime(group.getFirstSeen())));
        
        if (!isFirstAlert) {
            sb.append(String.format("**持续时长:** %s\n", 
                formatDuration(group.getFirstSeen(), Instant.now())));
            sb.append(String.format("**通知原因:** %s\n", reason));
        }
        
        sb.append("\n**告警详情:**\n");
        sb.append(trigger.getRawMessage());
        
        return sb.toString();
    }
}

效果评估与踩坑

上线之后统计了两个月的数据：

指标	上线前	上线后
每日告警通知数量	平均147条	平均23条
噪音率（重复/级联告警）	约78%	约18%
工程师对"告警有意义"的满意度	2.1/5	4.0/5
漏报率（本应通知但被抑制）	-	1.2%（人工复核）