第2153篇：模型评估数据集构建——从业务日志挖掘真实测试集的工程方法

老张2026/4/30大约 9 分钟

第2153篇：模型评估数据集构建——从业务日志挖掘真实测试集的工程方法

适读人群：需要构建高质量评估数据集的AI工程师 | 阅读时长：约18分钟 | 核心价值：从生产日志中高效挖掘真实、有代表性的测试集，而不是靠拍脑袋手写测试用例

评估数据集的质量决定了评估的有效性。这是很多团队忽视的问题。

我见过很多团队的评估集是这么来的：工程师坐下来，凭空想了50个问题，手写了50个标准答案，然后用这50个样本说"我们的模型准确率是90%"。

这个90%有意义吗？非常有限。因为工程师想到的问题，往往是他们认为"系统应该能回答好的"问题，而不是用户实际问的问题。测试集不代表真实分布，评估结论就不可靠。

真正有价值的测试集来自生产日志——用户实际发过来的问题，用他们真实的措辞，覆盖他们真实的需求分布。这篇文章讲怎么从日志里挖掘和构建测试集。

为什么生产日志比手写测试集好

手写测试集的问题：
1. 工程师偏向写"正常"问题，边界情况覆盖不足
2. 措辞过于标准，不反映用户真实表达方式
3. 分布不真实：高频问题和低频长尾都各占一半
4. 没有真实的意图分布（不知道哪类问题占大头）

生产日志的优势：
1. 真实用户措辞，包含拼写错误、口语化表达、省略主语
2. 真实分布：高频问题自然占多数
3. 包含你没想到的边界情况
4. 时间连续性：可以看问题分布随时间的变化

当然，生产日志也有问题：没有标准答案，需要标注；质量参差不齐，需要过滤；数量太大，需要采样。

从日志构建测试集的完整流程

第一步：日志清洗与过滤

/**
 * 日志清洗服务
 * 
 * 过滤掉不适合作为测试集的交互记录
 */
@Service
@RequiredArgsConstructor
@Slf4j
public class LogCleaningService {

    @Value("${dataset.min-input-length:10}")
    private int minInputLength;

    @Value("${dataset.max-input-length:500}")
    private int maxInputLength;

    /**
     * 从原始日志中过滤出适合的样本
     */
    public List<CleanedInteraction> cleanLogs(List<RawInteraction> rawLogs) {
        return rawLogs.stream()
            .filter(this::isValidInteraction)
            .map(this::normalize)
            .collect(Collectors.toList());
    }

    private boolean isValidInteraction(RawInteraction raw) {
        String input = raw.getUserInput();
        String output = raw.getLlmOutput();
        
        // 长度过滤
        if (input == null || input.trim().length() < minInputLength) return false;
        if (input.length() > maxInputLength) return false;
        
        // 过滤测试请求（通常包含测试关键词）
        if (isTestRequest(input)) return false;
        
        // 过滤API错误（系统故障时的输出不代表模型能力）
        if (output == null || output.contains("系统异常") || output.contains("服务暂时不可用")) return false;
        
        // 过滤超时响应
        if (raw.getLatencyMs() > 30000) return false;
        
        // 过滤重复问题（保留第一次出现的）
        // 注意：这里用简单的字符串匹配，实际可以用语义去重
        return true; // 重复过滤在后续步骤做
    }

    private boolean isTestRequest(String input) {
        String lower = input.toLowerCase();
        return lower.contains("test") || lower.contains("测试") || 
               lower.contains("hello") || lower.contains("你好吗") ||
               lower.equals("1") || lower.equals("？") || lower.equals("?");
    }

    private CleanedInteraction normalize(RawInteraction raw) {
        String input = raw.getUserInput().trim();
        // 脱敏处理：手机号、身份证、邮箱等
        input = desensitize(input);
        
        return CleanedInteraction.builder()
            .id(raw.getId())
            .userInput(input)
            .llmOutput(raw.getLlmOutput())
            .timestamp(raw.getTimestamp())
            .userId(raw.getUserId())
            .sessionId(raw.getSessionId())
            .metadata(raw.getMetadata())
            .build();
    }

    private String desensitize(String text) {
        // 手机号脱敏
        text = text.replaceAll("1[3-9]\\d{9}", "1****8888");
        // 身份证脱敏
        text = text.replaceAll("\\d{15}|\\d{18}|\\d{17}[xX]", "***");
        // 邮箱脱敏
        text = text.replaceAll("[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}", "***@***.com");
        return text;
    }
}

第二步：意图分类与语义聚类

/**
 * 意图聚类服务
 * 
 * 把大量日志聚类成意图类别，用于分层采样
 */
@Service
@RequiredArgsConstructor
@Slf4j
public class IntentClusteringService {

    private final EmbeddingModel embeddingModel;
    private final ChatClient analysisClient;

    /**
     * 对清洗后的交互进行意图聚类
     * 
     * 使用K-Means对embedding向量聚类
     */
    public List<IntentCluster> clusterByIntent(List<CleanedInteraction> interactions, int numClusters) {
        log.info("开始对{}条交互进行意图聚类，目标聚类数={}", interactions.size(), numClusters);
        
        // 1. 获取embedding
        List<float[]> embeddings = interactions.parallelStream()
            .map(i -> embeddingModel.embed(i.getUserInput()))
            .collect(Collectors.toList());
        
        // 2. K-Means聚类
        int[][] assignments = kMeans(embeddings, numClusters, 100);
        
        // 3. 按聚类分组
        Map<Integer, List<CleanedInteraction>> clusterMap = new HashMap<>();
        for (int i = 0; i < interactions.size(); i++) {
            int cluster = assignments[0][i];
            clusterMap.computeIfAbsent(cluster, k -> new ArrayList<>()).add(interactions.get(i));
        }
        
        // 4. 为每个聚类生成描述
        List<IntentCluster> clusters = new ArrayList<>();
        for (Map.Entry<Integer, List<CleanedInteraction>> entry : clusterMap.entrySet()) {
            List<CleanedInteraction> clusterItems = entry.getValue();
            
            // 取代表性样本（最靠近中心的5个）
            List<CleanedInteraction> representatives = selectRepresentatives(
                clusterItems, embeddings, entry.getKey(), 5
            );
            
            // 让LLM给这个聚类起一个意图名称
            String intentName = generateIntentName(representatives);
            
            clusters.add(IntentCluster.builder()
                .clusterId(entry.getKey())
                .intentName(intentName)
                .totalCount(clusterItems.size())
                .percentage((double) clusterItems.size() / interactions.size())
                .representatives(representatives)
                .allItems(clusterItems)
                .build());
        }
        
        // 按频次排序
        clusters.sort(Comparator.comparingInt(IntentCluster::getTotalCount).reversed());
        
        log.info("聚类完成，各意图分布：");
        clusters.forEach(c -> log.info("  {}: {}条 ({:.1f}%)", 
            c.getIntentName(), c.getTotalCount(), c.getPercentage() * 100));
        
        return clusters;
    }

    private String generateIntentName(List<CleanedInteraction> representatives) {
        String examples = representatives.stream()
            .map(r -> "- " + r.getUserInput())
            .collect(Collectors.joining("\n"));
        
        String prompt = String.format("""
            以下是一组相似的用户问题，请用5-10个字概括这组问题的主要意图/话题。
            直接输出意图名称，不要其他解释。
            
            %s
            """, examples);
        
        return analysisClient.prompt().user(prompt).call().content().trim();
    }

    // 简化的K-Means实现（生产中建议用Apache Commons Math或专门的ML库）
    private int[][] kMeans(List<float[]> embeddings, int k, int maxIter) {
        int n = embeddings.size();
        int dim = embeddings.get(0).length;
        
        // 随机初始化中心点
        Random rand = new Random(42);
        float[][] centers = new float[k][dim];
        for (int i = 0; i < k; i++) {
            centers[i] = embeddings.get(rand.nextInt(n)).clone();
        }
        
        int[] assignments = new int[n];
        
        for (int iter = 0; iter < maxIter; iter++) {
            boolean changed = false;
            
            // 分配步骤
            for (int i = 0; i < n; i++) {
                int bestCluster = 0;
                double bestDist = Double.MAX_VALUE;
                for (int j = 0; j < k; j++) {
                    double dist = euclideanDist(embeddings.get(i), centers[j]);
                    if (dist < bestDist) {
                        bestDist = dist;
                        bestCluster = j;
                    }
                }
                if (assignments[i] != bestCluster) {
                    assignments[i] = bestCluster;
                    changed = true;
                }
            }
            
            if (!changed) break;
            
            // 更新中心点
            float[][] newCenters = new float[k][dim];
            int[] counts = new int[k];
            for (int i = 0; i < n; i++) {
                int c = assignments[i];
                for (int d = 0; d < dim; d++) {
                    newCenters[c][d] += embeddings.get(i)[d];
                }
                counts[c]++;
            }
            for (int j = 0; j < k; j++) {
                if (counts[j] > 0) {
                    for (int d = 0; d < dim; d++) {
                        centers[j][d] = newCenters[j][d] / counts[j];
                    }
                }
            }
        }
        
        return new int[][]{assignments};
    }

    private double euclideanDist(float[] a, float[] b) {
        double sum = 0;
        for (int i = 0; i < a.length; i++) {
            double diff = a[i] - b[i];
            sum += diff * diff;
        }
        return Math.sqrt(sum);
    }
    
    private List<CleanedInteraction> selectRepresentatives(
            List<CleanedInteraction> items, 
            List<float[]> allEmbeddings,
            int clusterId, 
            int k) {
        // 简化实现：直接取前k个
        return items.stream().limit(k).collect(Collectors.toList());
    }
}

第三步：分层采样与困难样本挖掘

/**
 * 智能采样服务
 * 
 * 目标：构建代表性强的测试集，重点覆盖困难样本
 */
@Service
@RequiredArgsConstructor
@Slf4j
public class StratifiedSamplingService {

    private final ChatClient analysisClient;
    private final Random random = new Random(42);

    /**
     * 分层采样
     * 
     * 按意图分布比例采样，保证测试集覆盖各类场景
     * 同时对"困难样本"过采样，提高测试集对问题的敏感度
     */
    public List<SampledInteraction> stratifiedSample(List<IntentCluster> clusters, 
                                                       int targetSize) {
        List<SampledInteraction> result = new ArrayList<>();
        
        // 计算每个意图的基础配额（按比例）
        Map<IntentCluster, Integer> quotas = computeQuotas(clusters, targetSize);
        
        for (IntentCluster cluster : clusters) {
            int quota = quotas.getOrDefault(cluster, 0);
            if (quota == 0) continue;
            
            List<CleanedInteraction> items = cluster.getAllItems();
            
            // 识别困难样本（模型历史上表现差的）
            List<CleanedInteraction> hardSamples = identifyHardSamples(items);
            
            // 困难样本配额占1/3，普通样本占2/3
            int hardQuota = Math.min(quota / 3, hardSamples.size());
            int normalQuota = quota - hardQuota;
            
            // 采样困难样本
            result.addAll(sampleFrom(hardSamples, hardQuota, cluster.getIntentName(), true));
            
            // 采样普通样本（排除已选的困难样本）
            List<CleanedInteraction> normalItems = items.stream()
                .filter(i -> !hardSamples.contains(i))
                .collect(Collectors.toList());
            result.addAll(sampleFrom(normalItems, normalQuota, cluster.getIntentName(), false));
        }
        
        // 打乱顺序
        Collections.shuffle(result, random);
        
        log.info("采样完成，共{}条，其中困难样本{}条", 
            result.size(), 
            result.stream().filter(SampledInteraction::isHardSample).count());
        
        return result;
    }

    /**
     * 识别困难样本
     * 
     * 困难样本的识别标准：
     * 1. 历史评估分数低的
     * 2. 用户有负面反馈的（点了"不满意"）
     * 3. 问题复杂度高的（多步推理、模糊表达）
     */
    private List<CleanedInteraction> identifyHardSamples(List<CleanedInteraction> items) {
        return items.stream()
            .filter(item -> {
                // 检查是否有低分历史记录
                boolean hadLowScore = item.getMetadata() != null && 
                    item.getMetadata().containsKey("evaluation_score") &&
                    Double.parseDouble(item.getMetadata().get("evaluation_score")) < 0.6;
                
                // 检查是否有负面用户反馈
                boolean hadNegativeFeedback = item.getMetadata() != null &&
                    "negative".equals(item.getMetadata().get("user_feedback"));
                
                // 检查问题复杂度（简单启发式：包含多个问号，或问题很长）
                boolean isComplex = item.getUserInput().chars().filter(c -> c == '？' || c == '?').count() > 1
                    || item.getUserInput().length() > 200;
                
                return hadLowScore || hadNegativeFeedback || isComplex;
            })
            .collect(Collectors.toList());
    }

    private Map<IntentCluster, Integer> computeQuotas(List<IntentCluster> clusters, int total) {
        Map<IntentCluster, Integer> quotas = new HashMap<>();
        int assigned = 0;
        
        // 先按比例分配
        for (IntentCluster cluster : clusters) {
            int quota = (int) Math.round(cluster.getPercentage() * total);
            // 每个意图至少保留5个，最多不超过总量的30%
            quota = Math.max(5, Math.min(quota, (int)(total * 0.3)));
            quotas.put(cluster, quota);
            assigned += quota;
        }
        
        // 调整总量（因为取整可能超出或不足）
        // 简化处理：按比例缩放
        final int finalAssigned = assigned;
        if (assigned != total) {
            quotas.replaceAll((k, v) -> (int) Math.round((double) v * total / finalAssigned));
        }
        
        return quotas;
    }

    private List<SampledInteraction> sampleFrom(List<CleanedInteraction> items, 
                                                  int count, 
                                                  String intentName,
                                                  boolean isHard) {
        List<CleanedInteraction> shuffled = new ArrayList<>(items);
        Collections.shuffle(shuffled, random);
        
        return shuffled.stream()
            .limit(count)
            .map(item -> SampledInteraction.builder()
                .interaction(item)
                .intentLabel(intentName)
                .isHardSample(isHard)
                .build())
            .collect(Collectors.toList());
    }
}

第四步：自动生成标准答案的工程方法

人工标注成本高，可以用半自动方式：先让强模型生成候选答案，再人工审核修改。

/**
 * 半自动标注服务
 * 
 * 工作流程：
 * 1. 用强模型（GPT-4o）生成候选标准答案
 * 2. 标注者只需审核和修改，不需要从零写答案
 * 3. 大幅降低标注成本（据我们的经验，人工审核是从头写的1/4时间）
 */
@Service
@RequiredArgsConstructor
public class SemiAutomaticAnnotationService {

    private final ChatClient strongModelClient; // GPT-4o或类似强模型
    private final KnowledgeBaseService knowledgeBase;

    /**
     * 为采样的交互生成候选标准答案
     */
    public List<AnnotationTask> generateAnnotationTasks(List<SampledInteraction> samples) {
        return samples.stream()
            .map(sample -> {
                String question = sample.getInteraction().getUserInput();
                
                // 从知识库检索相关上下文
                List<String> contexts = knowledgeBase.search(question, 5);
                
                // 生成候选答案
                String candidateAnswer = generateCandidateAnswer(question, contexts);
                
                return AnnotationTask.builder()
                    .id(UUID.randomUUID().toString())
                    .question(question)
                    .contexts(contexts)
                    .candidateAnswer(candidateAnswer)
                    .intentLabel(sample.getIntentLabel())
                    .isHardSample(sample.isHardSample())
                    .status(AnnotationStatus.PENDING_REVIEW)
                    .build();
            })
            .collect(Collectors.toList());
    }

    private String generateCandidateAnswer(String question, List<String> contexts) {
        String contextText = contexts.isEmpty() ? "" : 
            "以下是参考资料：\n" + String.join("\n---\n", contexts) + "\n\n";
        
        String prompt = contextText + 
            "请根据以上资料，为以下问题生成一个准确、完整的标准答案。\n" +
            "要求：只基于提供的资料，不要添加不确定的信息。\n\n" +
            "问题：" + question;
        
        return strongModelClient.prompt().user(prompt).call().content();
    }
}

踩坑经验

坑1：测试集的时间分布要对

如果你的系统在持续迭代，用6个月前的日志构建测试集，测的是6个月前的问题分布，可能跟现在的业务偏差很大。建议每隔1-2个月用新日志更新测试集的20-30%。

坑2：去重要做在语义层，不是字符串层

"怎么退款"和"如何办理退款手续"是同一个问题，字符串完全不同。测试集里如果有大量语义重复的问题，会高估模型在某些场景的表现。用embedding做语义去重，相似度>0.95的只保留一个。

坑3：注意测试集污染

如果你用生产日志构建测试集，而这些日志已经被用来做了Fine-tuning，那测试集就泄漏了——模型见过这些问题，测试结果虚高。一定要保证测试集和训练集在时间上隔离。