第2215篇：多模态数据标注工程——图文对数据的标注流程和质量控制

老张2026/4/30大约 14 分钟

第2215篇：多模态数据标注工程——图文对数据的标注流程和质量控制

适读人群：AI工程师、数据工程师、做多模态应用的Java开发者 | 阅读时长：约18分钟 | 核心价值：从零到一搭建企业级图文数据标注流水线，掌握质量控制核心方法

去年我们团队接到一个电商视觉理解项目，目标是让模型能理解商品图片并生成精准描述。产品经理说："不就是找人打几个标签嘛，一周搞定。"

结果三周过去了，第一批标注数据回来，我打开一看，差点气晕——同一张连衣裙的图片，三个标注员分别写了"红色碎花裙""玫瑰印花连衣裙""夏季波点裙"。颜色判断不一致，花纹描述南辕北辙，有些标注甚至把产品ID写到了描述字段里。

用这批数据微调出来的模型，表现比基础模型还差。

那一刻我意识到：多模态数据标注不是"找人打标签"，是一套需要精心设计的工程体系。

为什么多模态标注比纯文本标注难得多

纯文本标注有相对客观的标准。一段文字是"正面情感"还是"负面情感"，标注员之间的分歧通常不大。但图文对标注天然带有主观性：

模糊性来源一：视觉感知差异。 同一张图，色彩感知因人而异。对某些人来说是"橙色"的东西，对另一些人可能是"橙红色"。

模糊性来源二：描述粒度不统一。 面对一张城市街景，有人写"繁华街道上的行人"，有人写"北京三里屯夜景，霓虹灯下人群涌动"，信息量差了三倍。

模糊性来源三：领域知识门槛。 医疗影像标注需要专业背景，标注员知识水平直接影响标注质量。

模糊性来源四：多模态对齐难题。 图片内容和文字描述之间的对应关系本身就是个主观判断，"这句话描述了图中第几个物体"没有唯一答案。

这些因素叠加，导致图文对数据的标注一致性（Inter-Annotator Agreement, IAA）远低于纯文本任务，Cohen's Kappa 能达到 0.6 就已经相当不错了。

标注流水线全景架构

在深入每个环节之前，先看整体流水线：

这个流水线有几个关键设计原则：

双人互审：同一条数据至少有两个独立标注结果
仲裁机制：分歧时有专家仲裁，而非简单多数票
黄金标准对齐：定期用已知正确答案检验标注员
数据飞轮：模型表现差的样本优先补充标注

标注任务设计：Annotation Schema

好的标注从好的 Schema 开始。一个图文对标注任务通常包含以下几类：

类型一：图片描述生成（Caption Generation） 要求标注员用自然语言描述图片内容，难点在于控制描述粒度和风格。

类型二：视觉问答（VQA） 给定图片和问题，标注正确答案。难点在于问题设计和答案的唯一性。

类型三：图文相关性打分（Relevance Scoring） 判断一段文字描述与图片的相关程度，通常用1-5分量表。

类型四：区域级标注（Region-Level Annotation） 框出图片中特定区域并标注，用于目标检测和视觉定位任务。

下面是一个针对电商场景的 Schema 定义：

/**
 * 电商图文对标注 Schema 定义
 * 用于规范化图文对标注数据的结构
 */
@Data
@Builder
@JsonInclude(JsonInclude.Include.NON_NULL)
public class AnnotationSchema {

    /** 标注任务唯一ID */
    private String taskId;

    /** 图片元数据 */
    private ImageMetadata imageMetadata;

    /** 主描述标注 */
    private CaptionAnnotation captionAnnotation;

    /** 属性标注列表 */
    private List<AttributeAnnotation> attributes;

    /** 图文相关性评分 */
    private RelevanceScore relevanceScore;

    /** 标注元信息 */
    private AnnotationMeta annotationMeta;

    @Data
    @Builder
    public static class ImageMetadata {
        private String imageId;
        private String imageUrl;
        private String imageMd5;      // 用于去重
        private Integer width;
        private Integer height;
        private String format;        // jpg/png/webp
        private Long fileSizeBytes;
    }

    @Data
    @Builder
    public static class CaptionAnnotation {
        /** 简短描述，20-50字 */
        private String shortCaption;
        /** 详细描述，100-200字 */
        private String detailedCaption;
        /** 描述语言 */
        private String language;
        /** 描述风格：neutral/marketing/technical */
        private String style;
    }

    @Data
    @Builder
    public static class AttributeAnnotation {
        /** 属性名称：color/material/pattern/category */
        private String attributeName;
        /** 标注值 */
        private String value;
        /** 置信度 0-1 */
        private Double confidence;
        /** 是否可见（图片中是否能直接看到）*/
        private Boolean isVisible;
    }

    @Data
    @Builder
    public static class RelevanceScore {
        /** 图文相关性得分 1-5 */
        private Integer score;
        /** 不相关原因（score <= 2时必填）*/
        private String irrelevanceReason;
    }

    @Data
    @Builder
    public static class AnnotationMeta {
        private String annotatorId;
        private Long annotationStartTime;
        private Long annotationEndTime;
        private Integer revisionCount;   // 修改次数，过高说明任务难度大
        private String annotatorComment; // 标注员备注
    }
}

标注平台的工程实现

标注平台是整个流水线的核心基础设施。商业平台（Scale AI、Label Studio、Labelbox）功能完善，但对于有定制化需求的团队，自建平台更灵活。

以下是核心任务分配和状态管理的实现：

/**
 * 标注任务调度服务
 * 负责任务分配、状态流转、负载均衡
 */
@Service
@Slf4j
public class AnnotationTaskScheduler {

    @Autowired
    private AnnotationTaskRepository taskRepository;

    @Autowired
    private AnnotatorRepository annotatorRepository;

    @Autowired
    private RedisTemplate<String, String> redisTemplate;

    // 每个任务需要独立标注的次数（双标注/三标注）
    private static final int REQUIRED_ANNOTATIONS_PER_TASK = 2;

    // 标注员最大并发任务数
    private static final int MAX_CONCURRENT_TASKS_PER_ANNOTATOR = 10;

    /**
     * 为标注员分配下一批任务
     * 采用优先级队列：难度匹配 > 领域专长 > 任务等待时间
     */
    public List<AnnotationTask> assignTasksToAnnotator(String annotatorId, int batchSize) {
        Annotator annotator = annotatorRepository.findById(annotatorId)
                .orElseThrow(() -> new AnnotatorNotFoundException(annotatorId));

        // 检查当前并发任务数
        int currentLoad = getCurrentActiveTaskCount(annotatorId);
        int availableSlots = MAX_CONCURRENT_TASKS_PER_ANNOTATOR - currentLoad;
        if (availableSlots <= 0) {
            log.warn("标注员 {} 当前任务已满，当前负载: {}", annotatorId, currentLoad);
            return Collections.emptyList();
        }

        int actualBatchSize = Math.min(batchSize, availableSlots);

        // 获取可分配任务（排除该标注员已标注过的）
        List<AnnotationTask> candidateTasks = taskRepository.findAvailableTasksForAnnotator(
                annotatorId,
                annotator.getDomainExpertise(),   // 领域专长过滤
                annotator.getDifficultyLevel(),    // 难度匹配
                REQUIRED_ANNOTATIONS_PER_TASK,     // 还需要几个标注
                actualBatchSize * 3                // 候选池扩大3倍，后续排序筛选
        );

        // 按优先级排序：紧急任务 > 等待时间长的任务 > 难度匹配度高的任务
        List<AnnotationTask> selectedTasks = candidateTasks.stream()
                .sorted(Comparator
                        .comparing(AnnotationTask::getPriority).reversed()
                        .thenComparing(AnnotationTask::getCreatedAt)
                        .thenComparing(t -> -calculateDifficultyMatch(t, annotator)))
                .limit(actualBatchSize)
                .collect(Collectors.toList());

        // 原子性分配：防止并发重复分配
        List<AnnotationTask> assignedTasks = new ArrayList<>();
        for (AnnotationTask task : selectedTasks) {
            boolean acquired = tryAcquireTask(task.getTaskId(), annotatorId);
            if (acquired) {
                task.setStatus(TaskStatus.IN_PROGRESS);
                task.setAssignedAnnotatorId(annotatorId);
                task.setAssignedAt(System.currentTimeMillis());
                taskRepository.save(task);
                assignedTasks.add(task);
            }
        }

        log.info("为标注员 {} 分配了 {} 个任务", annotatorId, assignedTasks.size());
        return assignedTasks;
    }

    /**
     * 使用 Redis 分布式锁防止任务重复分配
     */
    private boolean tryAcquireTask(String taskId, String annotatorId) {
        String lockKey = "task:lock:" + taskId;
        Boolean acquired = redisTemplate.opsForValue()
                .setIfAbsent(lockKey, annotatorId, Duration.ofMinutes(30));
        return Boolean.TRUE.equals(acquired);
    }

    /**
     * 提交标注结果，触发质量审核流程
     */
    @Transactional
    public AnnotationSubmitResult submitAnnotation(String taskId, String annotatorId,
                                                    AnnotationSchema annotation) {
        AnnotationTask task = taskRepository.findById(taskId)
                .orElseThrow(() -> new TaskNotFoundException(taskId));

        // 校验任务归属
        if (!annotatorId.equals(task.getAssignedAnnotatorId())) {
            throw new UnauthorizedAnnotationException("任务不属于该标注员");
        }

        // 基础格式校验
        ValidationResult validationResult = validateAnnotationFormat(annotation);
        if (!validationResult.isValid()) {
            return AnnotationSubmitResult.rejected(validationResult.getErrors());
        }

        // 保存标注结果
        annotation.getAnnotationMeta().setAnnotationEndTime(System.currentTimeMillis());
        task.addAnnotationResult(annotation);
        task.setStatus(TaskStatus.SUBMITTED);
        taskRepository.save(task);

        // 检查是否已收到足够的标注数量
        if (task.getAnnotationResults().size() >= REQUIRED_ANNOTATIONS_PER_TASK) {
            // 触发一致性检查
            triggerConsistencyCheck(task);
        }

        return AnnotationSubmitResult.success(taskId);
    }

    private double calculateDifficultyMatch(AnnotationTask task, Annotator annotator) {
        // 难度匹配度：标注员经验等级与任务难度的匹配程度
        int diff = Math.abs(task.getDifficultyLevel() - annotator.getDifficultyLevel());
        return 1.0 / (1 + diff);
    }

    private int getCurrentActiveTaskCount(String annotatorId) {
        return taskRepository.countByAssignedAnnotatorIdAndStatus(
                annotatorId, TaskStatus.IN_PROGRESS);
    }

    private void triggerConsistencyCheck(AnnotationTask task) {
        // 发布事件，由 QualityControlService 异步处理
        applicationEventPublisher.publishEvent(new ConsistencyCheckEvent(task));
    }

    @Autowired
    private ApplicationEventPublisher applicationEventPublisher;

    private ValidationResult validateAnnotationFormat(AnnotationSchema annotation) {
        List<String> errors = new ArrayList<>();

        if (annotation.getCaptionAnnotation() == null) {
            errors.add("缺少主描述标注");
        } else {
            String shortCaption = annotation.getCaptionAnnotation().getShortCaption();
            if (shortCaption == null || shortCaption.length() < 10) {
                errors.add("简短描述不能少于10个字符");
            }
            if (shortCaption != null && shortCaption.length() > 100) {
                errors.add("简短描述不能超过100个字符");
            }
        }

        if (annotation.getAttributes() == null || annotation.getAttributes().isEmpty()) {
            errors.add("至少需要一个属性标注");
        }

        return errors.isEmpty() ? ValidationResult.valid() : ValidationResult.invalid(errors);
    }
}

质量控制：IAA 计算与监控

标注质量的核心指标是标注一致性（Inter-Annotator Agreement）。对于不同类型的标注任务，使用不同的 IAA 计算方法：

/**
 * 标注质量控制服务
 * 负责IAA计算、黄金标准对齐、标注员绩效监控
 */
@Service
@Slf4j
public class QualityControlService {

    @Autowired
    private AnnotationTaskRepository taskRepository;

    @Autowired
    private AnnotatorPerformanceRepository performanceRepository;

    @Autowired
    private GoldenDatasetRepository goldenDatasetRepository;

    /**
     * 计算两个标注员之间的 Cohen's Kappa 一致性系数
     * 适用于分类标注任务
     *
     * @param annotations1 标注员1的标注列表
     * @param annotations2 标注员2的标注列表
     * @param attributeName 待计算的属性名
     * @return Kappa系数，范围[-1, 1]，>0.6为较好，>0.8为优秀
     */
    public double calculateCohensKappa(List<AnnotationSchema> annotations1,
                                        List<AnnotationSchema> annotations2,
                                        String attributeName) {
        if (annotations1.size() != annotations2.size()) {
            throw new IllegalArgumentException("两组标注数量不一致");
        }

        // 收集所有类别标签
        Set<String> allLabels = new HashSet<>();
        for (int i = 0; i < annotations1.size(); i++) {
            allLabels.add(getAttributeValue(annotations1.get(i), attributeName));
            allLabels.add(getAttributeValue(annotations2.get(i), attributeName));
        }

        List<String> labels = new ArrayList<>(allLabels);
        int n = annotations1.size();
        int k = labels.size();

        // 构建混淆矩阵
        int[][] confusionMatrix = new int[k][k];
        for (int i = 0; i < n; i++) {
            String label1 = getAttributeValue(annotations1.get(i), attributeName);
            String label2 = getAttributeValue(annotations2.get(i), attributeName);
            int idx1 = labels.indexOf(label1);
            int idx2 = labels.indexOf(label2);
            if (idx1 >= 0 && idx2 >= 0) {
                confusionMatrix[idx1][idx2]++;
            }
        }

        // 计算观察一致性 Po
        double po = 0;
        for (int i = 0; i < k; i++) {
            po += confusionMatrix[i][i];
        }
        po /= n;

        // 计算期望一致性 Pe
        double pe = 0;
        for (int i = 0; i < k; i++) {
            int rowSum = 0, colSum = 0;
            for (int j = 0; j < k; j++) {
                rowSum += confusionMatrix[i][j];
                colSum += confusionMatrix[j][i];
            }
            pe += (double) rowSum * colSum / (n * n);
        }

        if (pe == 1.0) {
            return 1.0; // 完全一致
        }

        return (po - pe) / (1 - pe);
    }

    /**
     * 对提交的标注任务进行一致性检查
     * 一致性不足时，路由到仲裁流程
     */
    @EventListener
    @Async
    public void handleConsistencyCheck(ConsistencyCheckEvent event) {
        AnnotationTask task = event.getTask();
        List<AnnotationSchema> results = task.getAnnotationResults();

        if (results.size() < 2) {
            return;
        }

        // 计算各维度一致性
        ConsistencyReport report = new ConsistencyReport();
        report.setTaskId(task.getTaskId());

        // 1. 分类属性一致性（颜色、类别等）
        for (String attribute : Arrays.asList("color", "category", "pattern")) {
            double kappa = calculateCohensKappa(
                    Collections.singletonList(results.get(0)),
                    Collections.singletonList(results.get(1)),
                    attribute
            );
            report.addAttributeKappa(attribute, kappa);
        }

        // 2. 文本描述相似度（用于Caption任务）
        double captionSimilarity = calculateCaptionSimilarity(
                results.get(0).getCaptionAnnotation().getShortCaption(),
                results.get(1).getCaptionAnnotation().getShortCaption()
        );
        report.setCaptionSimilarity(captionSimilarity);

        // 3. 相关性评分差异
        int scoreDiff = Math.abs(
                results.get(0).getRelevanceScore().getScore() -
                results.get(1).getRelevanceScore().getScore()
        );
        report.setRelevanceScoreDiff(scoreDiff);

        // 判断是否需要仲裁
        boolean needsArbitration = report.getAverageKappa() < 0.6
                || captionSimilarity < 0.5
                || scoreDiff > 1;

        if (needsArbitration) {
            log.warn("任务 {} 标注一致性不足，Kappa={}, CaptionSim={}, ScoreDiff={}，路由到仲裁",
                    task.getTaskId(), report.getAverageKappa(), captionSimilarity, scoreDiff);
            routeToArbitration(task, report);
        } else {
            // 合并标注结果（取平均或多数投票）
            mergeAnnotationResults(task, results);
            task.setStatus(TaskStatus.COMPLETED);
            task.setQualityReport(report);
            taskRepository.save(task);

            // 更新标注员绩效
            updateAnnotatorPerformance(results, report);
        }
    }

    /**
     * 黄金标准测试：定期插入已知正确答案的任务检验标注员
     */
    public GoldenTestResult evaluateAnnotatorWithGoldenSet(String annotatorId) {
        List<GoldenDataItem> goldenItems = goldenDatasetRepository.findRandomSample(20);
        AnnotatorPerformance performance = new AnnotatorPerformance();
        performance.setAnnotatorId(annotatorId);

        int correctCount = 0;
        List<GoldenTestDetail> details = new ArrayList<>();

        for (GoldenDataItem item : goldenItems) {
            // 获取该标注员对该任务的标注结果
            AnnotationSchema annotatorResult = taskRepository
                    .findAnnotationByTaskAndAnnotator(item.getTaskId(), annotatorId);

            if (annotatorResult == null) {
                continue; // 标注员还未标注该任务
            }

            // 与黄金标准比较
            boolean isCorrect = compareWithGoldenStandard(annotatorResult, item.getGoldenAnnotation());
            if (isCorrect) correctCount++;

            details.add(GoldenTestDetail.builder()
                    .taskId(item.getTaskId())
                    .isCorrect(isCorrect)
                    .annotatorAnswer(annotatorResult)
                    .goldenAnswer(item.getGoldenAnnotation())
                    .build());
        }

        double accuracy = goldenItems.isEmpty() ? 0 : (double) correctCount / goldenItems.size();

        GoldenTestResult result = GoldenTestResult.builder()
                .annotatorId(annotatorId)
                .accuracy(accuracy)
                .testedCount(goldenItems.size())
                .correctCount(correctCount)
                .details(details)
                .testTime(System.currentTimeMillis())
                .build();

        // 精度低于阈值，暂停标注员任务分配
        if (accuracy < 0.75) {
            log.error("标注员 {} 黄金标准测试精度 {} 低于阈值，暂停任务分配", annotatorId, accuracy);
            suspendAnnotator(annotatorId, "黄金标准测试精度不足: " + accuracy);
        }

        return result;
    }

    /**
     * 基于字符 N-gram 的描述相似度计算
     * 轻量级，不依赖模型
     */
    private double calculateCaptionSimilarity(String caption1, String caption2) {
        if (caption1 == null || caption2 == null) return 0.0;

        Set<String> ngrams1 = extractNgrams(caption1, 2);
        Set<String> ngrams2 = extractNgrams(caption2, 2);

        Set<String> intersection = new HashSet<>(ngrams1);
        intersection.retainAll(ngrams2);

        Set<String> union = new HashSet<>(ngrams1);
        union.addAll(ngrams2);

        return union.isEmpty() ? 0.0 : (double) intersection.size() / union.size();
    }

    private Set<String> extractNgrams(String text, int n) {
        Set<String> ngrams = new HashSet<>();
        // 去除标点和空格，取字符级 N-gram
        String normalized = text.replaceAll("[\\s\\p{Punct}]", "").toLowerCase();
        for (int i = 0; i <= normalized.length() - n; i++) {
            ngrams.add(normalized.substring(i, i + n));
        }
        return ngrams;
    }

    private String getAttributeValue(AnnotationSchema annotation, String attributeName) {
        return annotation.getAttributes().stream()
                .filter(a -> attributeName.equals(a.getAttributeName()))
                .map(AnnotationSchema.AttributeAnnotation::getValue)
                .findFirst()
                .orElse("UNKNOWN");
    }

    private void routeToArbitration(AnnotationTask task, ConsistencyReport report) {
        task.setStatus(TaskStatus.ARBITRATION_REQUIRED);
        task.setConsistencyReport(report);
        taskRepository.save(task);
        // 通知仲裁员
    }

    private void updateAnnotatorPerformance(List<AnnotationSchema> results, ConsistencyReport report) {
        for (AnnotationSchema result : results) {
            String annotatorId = result.getAnnotationMeta().getAnnotatorId();
            AnnotatorPerformance perf = performanceRepository.findByAnnotatorId(annotatorId)
                    .orElseGet(() -> AnnotatorPerformance.newInstance(annotatorId));
            perf.addKappaRecord(report.getAverageKappa());
            perf.incrementCompletedCount();
            performanceRepository.save(perf);
        }
    }

    private void suspendAnnotator(String annotatorId, String reason) {
        // 暂停标注员接单资格
    }

    private void mergeAnnotationResults(AnnotationTask task, List<AnnotationSchema> results) {
        // 合并逻辑：分类取多数投票，评分取平均
    }

    private boolean compareWithGoldenStandard(AnnotationSchema annotatorResult,
                                               AnnotationSchema goldenAnnotation) {
        // 与黄金标准的多维度比较
        return true;
    }
}

标注员绩效仪表盘

质量控制不只是事后审查，还需要实时监控标注员状态：

/**
 * 标注员绩效统计与预警
 */
@RestController
@RequestMapping("/api/annotation/quality")
public class AnnotationQualityDashboardController {

    @Autowired
    private QualityControlService qualityControlService;

    @Autowired
    private AnnotatorPerformanceRepository performanceRepository;

    /**
     * 获取标注员绩效报告
     */
    @GetMapping("/annotator/{annotatorId}/performance")
    public AnnotatorPerformanceReport getAnnotatorPerformance(
            @PathVariable String annotatorId,
            @RequestParam(defaultValue = "7") int days) {

        AnnotatorPerformance perf = performanceRepository.findByAnnotatorId(annotatorId)
                .orElseThrow(() -> new AnnotatorNotFoundException(annotatorId));

        LocalDate endDate = LocalDate.now();
        LocalDate startDate = endDate.minusDays(days);

        return AnnotatorPerformanceReport.builder()
                .annotatorId(annotatorId)
                .period(startDate + " ~ " + endDate)
                .totalCompleted(perf.getCompletedCountInPeriod(startDate, endDate))
                .averageKappa(perf.getAverageKappaInPeriod(startDate, endDate))
                .averageAnnotationTimeSeconds(perf.getAverageAnnotationTime(startDate, endDate))
                .goldenAccuracy(qualityControlService.evaluateAnnotatorWithGoldenSet(annotatorId).getAccuracy())
                .arbitrationRate(perf.getArbitrationRateInPeriod(startDate, endDate))
                .build();
    }

    /**
     * 获取整体数据质量概览
     */
    @GetMapping("/overview")
    public QualityOverview getQualityOverview(@RequestParam String projectId) {
        return QualityOverview.builder()
                .projectId(projectId)
                .totalTasks(taskRepository.countByProjectId(projectId))
                .completedTasks(taskRepository.countByProjectIdAndStatus(projectId, TaskStatus.COMPLETED))
                .arbitrationPendingTasks(taskRepository.countByProjectIdAndStatus(projectId, TaskStatus.ARBITRATION_REQUIRED))
                .averageIAA(taskRepository.getAverageKappaByProject(projectId))
                .annotatorCount(annotatorRepository.countActiveByProject(projectId))
                .build();
    }

    @Autowired
    private AnnotationTaskRepository taskRepository;

    @Autowired
    private AnnotatorRepository annotatorRepository;
}

困难样本挖掘：让标注更有价值

不是每张图片都值得花同等精力标注。通过主动学习（Active Learning），优先标注对模型提升最大的样本：

/**
 * 主动学习样本选择策略
 * 优先选择模型不确定度高的样本进行标注
 */
@Service
public class ActiveLearningSelector {

    @Autowired
    private ModelInferenceClient modelClient;

    /**
     * 基于不确定度采样：选择模型预测置信度最低的样本
     * 这些样本通常是边界样本，标注价值最高
     */
    public List<String> selectByUncertaintySampling(List<String> candidateImageIds, int topK) {
        Map<String, Double> uncertaintyScores = new HashMap<>();

        for (String imageId : candidateImageIds) {
            // 获取模型对该图片的预测概率分布
            ModelPrediction prediction = modelClient.predict(imageId);

            // 计算熵作为不确定度度量
            double entropy = calculateEntropy(prediction.getProbabilities());
            uncertaintyScores.put(imageId, entropy);
        }

        // 返回不确定度最高的 TopK 样本
        return uncertaintyScores.entrySet().stream()
                .sorted(Map.Entry.<String, Double>comparingByValue().reversed())
                .limit(topK)
                .map(Map.Entry::getKey)
                .collect(Collectors.toList());
    }

    /**
     * 基于多样性采样：选择特征空间分布均匀的样本
     * 避免过度标注相似样本
     */
    public List<String> selectByDiversitySampling(List<float[]> featureVectors,
                                                    List<String> imageIds, int topK) {
        // 使用贪心 Core-Set 算法
        List<String> selected = new ArrayList<>();
        List<float[]> selectedFeatures = new ArrayList<>();

        // 随机选第一个
        int firstIdx = new Random().nextInt(imageIds.size());
        selected.add(imageIds.get(firstIdx));
        selectedFeatures.add(featureVectors.get(firstIdx));

        while (selected.size() < topK) {
            double maxMinDist = -1;
            int bestIdx = -1;

            for (int i = 0; i < imageIds.size(); i++) {
                if (selected.contains(imageIds.get(i))) continue;

                // 计算到已选集合中最近点的距离
                double minDist = selectedFeatures.stream()
                        .mapToDouble(sf -> euclideanDistance(featureVectors.get(i), sf))
                        .min()
                        .orElse(Double.MAX_VALUE);

                if (minDist > maxMinDist) {
                    maxMinDist = minDist;
                    bestIdx = i;
                }
            }

            if (bestIdx >= 0) {
                selected.add(imageIds.get(bestIdx));
                selectedFeatures.add(featureVectors.get(bestIdx));
            }
        }

        return selected;
    }

    private double calculateEntropy(double[] probabilities) {
        double entropy = 0;
        for (double p : probabilities) {
            if (p > 0) {
                entropy -= p * Math.log(p) / Math.log(2);
            }
        }
        return entropy;
    }

    private double euclideanDistance(float[] v1, float[] v2) {
        double sum = 0;
        for (int i = 0; i < v1.length; i++) {
            sum += Math.pow(v1[i] - v2[i], 2);
        }
        return Math.sqrt(sum);
    }
}

实践中的几个坑

坑一：标注指南写得太模糊。 我们第一版指南里写"描述图片的主要内容"，结果每个标注员理解不同。后来改成"描述图片中最显眼的一个物体，包含：颜色+材质+形状+用途，不超过30字"，一致性立刻从0.52提升到0.74。

坑二：黄金标准数据集更新不及时。 随着业务场景变化，原来的黄金标准可能已经不适用，但评测用的还是旧数据集，导致高分标注员产出的数据质量实际上已经下滑。建议每月更新20%的黄金标准数据。

坑三：标注速度和质量的权衡没有量化。 起初我们只考核数量，不考核质量，导致标注员为了多赚钱快速滑过图片，IAA 只有0.4。后来引入"质量加权产量"计算绩效：实际入库数量 = 标注数量 × Kappa系数，标注员行为立刻改变。

坑四：仲裁机制成了瓶颈。 初期仲裁全靠专家人工处理，一旦标注量上来，仲裁队列堆积严重。解决办法是引入"模型辅助仲裁"——对于分歧不大的案例，让模型给出建议，仲裁员只需确认，效率提升5倍。

小结

多模态数据标注工程的核心是"数据质量是工程问题，不是管理问题"。

从任务设计（Schema 清晰度）、平台实现（状态机 + 分布式锁）、质量控制（IAA 计算 + 黄金标准）到主动学习（困难样本优先），每个环节都有工程手段可以介入。

数据质量的天花板，决定了模型能力的天花板。在数据标注上投入的每一分工程严谨度，最终都会体现在模型效果上。