第2161篇：主动学习在LLM标注中的应用——用最少人力获取最有价值数据

老张2026/4/30大约 7 分钟

第2161篇：主动学习在LLM标注中的应用——用最少人力获取最有价值数据

适读人群：需要高效构建训练数据集的AI工程师 | 阅读时长：约17分钟 | 核心价值：用主动学习策略最大化标注效率，让有限的标注预算产生最大的模型改善效果

标注预算只有1万条，但我们有100万条候选数据。

随机抽1万条标注吗？这是最简单的方法，但不是最好的。

问题在于：100万条数据里，大部分是"容易的"样本——模型已经能正确处理，标注这些对模型改善几乎没有帮助。真正有价值的是那些模型"不确定的"、"容易犯错的"样本。

主动学习（Active Learning）就是解决这个问题的：不是随机采样，而是智能地选择最有价值的样本去标注。

主动学习的核心策略

策略1：不确定性采样（Uncertainty Sampling）
→ 选择模型最不确定的样本
→ 对LLM来说：选择多次采样结果差异最大的样本

策略2：多样性采样（Diversity Sampling）
→ 选择覆盖尽可能多"新领域"的样本
→ 避免重复标注相似的样本

策略3：期望模型变化（Expected Model Change）
→ 选择标注后最可能改变模型的样本
→ 计算复杂，通常用近似方法

策略4：错误预测（Error Prediction）
→ 选择模型最可能犯错的样本
→ 用已知的错误模式来预测哪些新样本会出错

在LLM场景，我们通常用策略1和策略2的组合。

主动学习实现

/**
 * LLM主动学习服务
 * 
 * 从大量候选样本中选出最有价值的样本用于标注
 */
@Service
@RequiredArgsConstructor
@Slf4j
public class ActiveLearningService {

    private final LlmSamplingService samplingService;
    private final EmbeddingModel embeddingModel;
    private final AnnotatedDataRepository annotatedRepository;

    /**
     * 主动选择最有价值的样本
     * 
     * @param candidates 候选样本池
     * @param budgetSize 标注预算（要选多少个）
     * @param strategy   选择策略
     * @return 选出的样本
     */
    public List<CandidateSample> selectForAnnotation(List<CandidateSample> candidates,
                                                      int budgetSize,
                                                      ActiveLearningStrategy strategy) {
        log.info("主动学习采样: 候选池={}, 目标={}, 策略={}", 
            candidates.size(), budgetSize, strategy);
        
        return switch (strategy) {
            case UNCERTAINTY -> selectByUncertainty(candidates, budgetSize);
            case DIVERSITY -> selectByDiversity(candidates, budgetSize);
            case HYBRID -> selectByHybrid(candidates, budgetSize);
            case ERROR_PREDICTION -> selectByErrorPrediction(candidates, budgetSize);
        };
    }

    /**
     * 不确定性采样
     * 
     * 对每个候选样本，多次运行LLM（温度不为0），
     * 计算输出的一致性——不一致的说明模型"不确定"
     */
    private List<CandidateSample> selectByUncertainty(List<CandidateSample> candidates, int k) {
        log.info("计算不确定性分数...");
        
        // 批量计算不确定性（成本高，需要控制）
        // 实际上只对子集做多次采样（随机抽取20%的候选样本计算不确定性）
        int evalSize = Math.min(candidates.size(), k * 5); // 最多评估5k个
        List<CandidateSample> evalCandidates = randomSample(candidates, evalSize);
        
        List<ScoredSample> scoredSamples = evalCandidates.parallelStream()
            .map(candidate -> {
                double uncertainty = computeUncertainty(candidate);
                return new ScoredSample(candidate, uncertainty);
            })
            .collect(Collectors.toList());
        
        // 不确定性高的排前面
        scoredSamples.sort(Comparator.comparingDouble(ScoredSample::getScore).reversed());
        
        return scoredSamples.stream()
            .limit(k)
            .map(ScoredSample::getSample)
            .collect(Collectors.toList());
    }

    /**
     * 计算单个样本的不确定性
     * 
     * 方法：多次采样（temperature>0），计算输出的语义相似度
     * 相似度越低 → 不确定性越高
     */
    private double computeUncertainty(CandidateSample candidate) {
        int numSamples = 5; // 采样5次
        List<String> outputs = samplingService.sampleMultiple(
            candidate.getInput(), numSamples, 0.8 // 温度0.8，产生多样输出
        );
        
        if (outputs.size() < 2) return 0.5;
        
        // 计算所有输出对之间的语义相似度
        List<float[]> embeddings = outputs.stream()
            .map(o -> embeddingModel.embed(o))
            .collect(Collectors.toList());
        
        double totalSimilarity = 0;
        int pairs = 0;
        
        for (int i = 0; i < embeddings.size(); i++) {
            for (int j = i + 1; j < embeddings.size(); j++) {
                totalSimilarity += cosineSimilarity(embeddings.get(i), embeddings.get(j));
                pairs++;
            }
        }
        
        double avgSimilarity = pairs > 0 ? totalSimilarity / pairs : 1.0;
        
        // 不确定性 = 1 - 一致性
        return 1.0 - avgSimilarity;
    }

    /**
     * 多样性采样
     * 
     * 使用贪心最大化最小距离（Core-Set方法的简化版）
     * 确保选出的样本覆盖候选空间的不同区域
     */
    private List<CandidateSample> selectByDiversity(List<CandidateSample> candidates, int k) {
        log.info("计算多样性采样...");
        
        // 获取所有候选样本的embedding
        Map<String, float[]> embeddings = candidates.stream()
            .collect(Collectors.toMap(
                CandidateSample::getId,
                c -> embeddingModel.embed(c.getInput())
            ));
        
        // 已标注数据的embedding（要选与已标注数据不同的）
        List<float[]> annotatedEmbeddings = annotatedRepository.findRecent(1000).stream()
            .map(a -> embeddingModel.embed(a.getInput()))
            .collect(Collectors.toList());
        
        List<CandidateSample> selected = new ArrayList<>();
        Set<String> selectedIds = new HashSet<>();
        
        for (int i = 0; i < k; i++) {
            // 对每个候选，计算其与已选样本和已标注数据的最小距离
            CandidateSample bestCandidate = null;
            double maxMinDist = Double.NEGATIVE_INFINITY;
            
            for (CandidateSample candidate : candidates) {
                if (selectedIds.contains(candidate.getId())) continue;
                
                float[] embedding = embeddings.get(candidate.getId());
                double minDist = computeMinDistance(embedding, selected, embeddings, annotatedEmbeddings);
                
                if (minDist > maxMinDist) {
                    maxMinDist = minDist;
                    bestCandidate = candidate;
                }
            }
            
            if (bestCandidate != null) {
                selected.add(bestCandidate);
                selectedIds.add(bestCandidate.getId());
            }
        }
        
        return selected;
    }

    /**
     * 混合策略：不确定性 + 多样性
     * 
     * 先按不确定性过滤出前30%，再从中做多样性采样
     * 兼顾"模型不确定的"和"覆盖多元场景的"
     */
    private List<CandidateSample> selectByHybrid(List<CandidateSample> candidates, int k) {
        // 第一步：不确定性过滤（选出不确定性最高的30%）
        int uncertaintyFilterSize = Math.min(candidates.size(), k * 3);
        List<CandidateSample> uncertaintyCandidates = selectByUncertainty(candidates, uncertaintyFilterSize);
        
        // 第二步：从不确定性候选中做多样性采样
        return selectByDiversity(uncertaintyCandidates, k);
    }

    /**
     * 错误预测采样
     * 
     * 用历史错误样本的embedding，找最相似的未标注样本
     * 这些样本可能有类似的错误
     */
    private List<CandidateSample> selectByErrorPrediction(List<CandidateSample> candidates, int k) {
        // 获取历史上评分低的样本
        List<AnnotatedSample> errorSamples = annotatedRepository.findLowScoreSamples(0.5, 200);
        
        if (errorSamples.isEmpty()) {
            log.warn("没有历史错误样本，降级到不确定性采样");
            return selectByUncertainty(candidates, k);
        }
        
        // 计算历史错误样本的平均embedding（错误"中心"）
        float[] errorCentroid = computeCentroid(
            errorSamples.stream()
                .map(s -> embeddingModel.embed(s.getInput()))
                .collect(Collectors.toList())
        );
        
        // 选择与错误中心最近的候选样本
        return candidates.stream()
            .map(c -> {
                float[] emb = embeddingModel.embed(c.getInput());
                double similarity = cosineSimilarity(emb, errorCentroid);
                return new ScoredSample(c, similarity);
            })
            .sorted(Comparator.comparingDouble(ScoredSample::getScore).reversed())
            .limit(k)
            .map(ScoredSample::getSample)
            .collect(Collectors.toList());
    }

    private double computeMinDistance(float[] embedding, 
                                       List<CandidateSample> selected,
                                       Map<String, float[]> allEmbeddings,
                                       List<float[]> annotatedEmbeddings) {
        double minDist = Double.MAX_VALUE;
        
        // 与已选样本的距离
        for (CandidateSample sel : selected) {
            float[] selEmb = allEmbeddings.get(sel.getId());
            if (selEmb != null) {
                double dist = 1.0 - cosineSimilarity(embedding, selEmb); // 距离=1-相似度
                minDist = Math.min(minDist, dist);
            }
        }
        
        // 与已标注数据的距离（避免选已充分覆盖的区域）
        for (float[] annEmb : annotatedEmbeddings) {
            double dist = 1.0 - cosineSimilarity(embedding, annEmb);
            minDist = Math.min(minDist, dist);
        }
        
        return minDist == Double.MAX_VALUE ? 1.0 : minDist;
    }

    private float[] computeCentroid(List<float[]> embeddings) {
        if (embeddings.isEmpty()) return new float[0];
        int dim = embeddings.get(0).length;
        float[] centroid = new float[dim];
        for (float[] emb : embeddings) {
            for (int i = 0; i < dim; i++) centroid[i] += emb[i];
        }
        for (int i = 0; i < dim; i++) centroid[i] /= embeddings.size();
        return centroid;
    }

    private double cosineSimilarity(float[] a, float[] b) {
        double dot = 0, normA = 0, normB = 0;
        for (int i = 0; i < Math.min(a.length, b.length); i++) {
            dot += a[i] * b[i]; normA += a[i] * a[i]; normB += b[i] * b[i];
        }
        return dot / (Math.sqrt(normA) * Math.sqrt(normB) + 1e-8);
    }

    private List<CandidateSample> randomSample(List<CandidateSample> candidates, int k) {
        List<CandidateSample> shuffled = new ArrayList<>(candidates);
        Collections.shuffle(shuffled, new Random(42));
        return shuffled.subList(0, Math.min(k, shuffled.size()));
    }
}

主动学习的效果评估

/**
 * 主动学习效果评估
 * 
 * 对比主动学习与随机采样的效率差异
 */
@Service
@RequiredArgsConstructor
public class ActiveLearningEvaluator {

    /**
     * 学习曲线分析
     * 
     * 在不同标注量下，对比主动学习vs随机采样的模型质量
     */
    public LearningCurveReport analyzeLearningCurve(
            List<CandidateSample> activeSamples,  // 主动学习选出的
            List<CandidateSample> randomSamples,   // 随机选出的（对照组）
            int[] annotationBudgets) {             // 如[100, 200, 500, 1000]
        
        List<LearningCurvePoint> activeCurve = new ArrayList<>();
        List<LearningCurvePoint> randomCurve = new ArrayList<>();
        
        for (int budget : annotationBudgets) {
            // 在budget限制下，评估两种策略的效果
            double activeScore = evaluateWithBudget(activeSamples, budget);
            double randomScore = evaluateWithBudget(randomSamples, budget);
            
            activeCurve.add(new LearningCurvePoint(budget, activeScore));
            randomCurve.add(new LearningCurvePoint(budget, randomScore));
        }
        
        return LearningCurveReport.builder()
            .activeLearningCurve(activeCurve)
            .randomSamplingCurve(randomCurve)
            .efficiencyGain(computeEfficiencyGain(activeCurve, randomCurve))
            .build();
    }
    
    /**
     * 计算效率增益
     * 
     * 如果主动学习用200个样本能达到随机采样500个样本的效果，
     * 效率增益 = 2.5x
     */
    private double computeEfficiencyGain(List<LearningCurvePoint> active, 
                                          List<LearningCurvePoint> random) {
        if (active.isEmpty() || random.isEmpty()) return 1.0;
        
        double targetScore = active.get(active.size() - 1).getScore();
        
        int activeBudgetToTarget = active.stream()
            .filter(p -> p.getScore() >= targetScore * 0.95)
            .mapToInt(LearningCurvePoint::getBudget).min().orElse(Integer.MAX_VALUE);
        
        int randomBudgetToTarget = random.stream()
            .filter(p -> p.getScore() >= targetScore * 0.95)
            .mapToInt(LearningCurvePoint::getBudget).min().orElse(Integer.MAX_VALUE);
        
        return activeBudgetToTarget > 0 ? (double) randomBudgetToTarget / activeBudgetToTarget : 1.0;
    }

    private double evaluateWithBudget(List<CandidateSample> samples, int budget) {
        // 在budget范围内训练/评估的模型分数（这里简化为直接查已有的评估结果）
        return samples.subList(0, Math.min(budget, samples.size())).stream()
            .mapToDouble(s -> s.getAnnotationScore() != null ? s.getAnnotationScore() : 0)
            .average().orElse(0);
    }
}

实践中的效率提升

根据我们项目的实际数据：

用主动学习选出的300条样本做Fine-tuning，效果相当于随机采样750条的效果。标注成本降低了60%，同时质量没有下降。

效果最好的场景是：错误预测采样。当我们有大量历史标注数据时，用错误预测找到模型最容易犯错的新样本，标注这些样本的ROI远高于随机标注。

注意：主动学习增加了工程复杂度（需要运行LLM做不确定性估计），有额外成本。一般从1000条样本以上的标注预算开始，主动学习的收益才能覆盖额外成本。