第1745篇：向量数据的生命周期管理——索引更新、增量Embedding与过期淘汰

老张大约 12 分钟

第1745篇：向量数据的生命周期管理——索引更新、增量Embedding与过期淘汰

搭建一个能跑起来的向量数据库系统不难，难的是跑起来之后怎么维护。

我见过不少团队在早期把所有文档全部 Embedding 进了 Milvus 或者 Qdrant，系统运行了三个月，发现有几个让人头疼的问题：业务文档每天都在更新，但向量库里的还是三个月前的老版本；一些早就不用的文档还占着大量存储空间；随着数据量增长，查询开始变慢，索引什么时候重建、怎么重建，谁也说不清楚。

这些问题合在一起，就是向量数据的生命周期管理问题。这篇文章把这套体系讲清楚。

一、向量数据生命周期的几个阶段

一条向量数据从诞生到被淘汰，大致要经历这几个阶段：

每个阶段都有工程上的挑战，我们逐一来看。

二、增量 Embedding：内容更新时的向量同步

2.1 问题的本质

企业知识库里的文档每天都在更新：产品手册改版了、FAQ 条目修正了、合同模板更新了。如果向量库和原始内容不同步，RAG 系统检索到的是旧内容，生成的答案就会出错，而且这种错误很隐蔽——系统看上去在正常运行，但给出的信息是错的。

2.2 变更检测机制

要做增量同步，首先要知道哪些内容发生了变化。不能全量重做——一个有几十万篇文档的知识库，每次全量 Embedding 的成本和耗时都无法接受。

变更检测的核心是内容哈希：对每份文档计算 MD5/SHA256，存起来。下次处理时对比哈希，不同的才重新 Embedding。

@Service
public class DocumentSyncService {

    @Autowired
    private DocumentRepository documentRepository;

    @Autowired
    private VectorStoreService vectorStoreService;

    @Autowired
    private EmbeddingService embeddingService;

    /**
     * 增量同步：只处理有变化的文档
     */
    @Scheduled(fixedDelay = 300000)  // 每 5 分钟检查一次
    public void incrementalSync() {
        // 找出所有需要同步的文档（新增或内容已变更）
        List<DocumentRecord> changedDocs = documentRepository
            .findByStatusIn(Arrays.asList(
                DocumentStatus.NEW,
                DocumentStatus.CONTENT_CHANGED,
                DocumentStatus.PENDING_SYNC));

        if (changedDocs.isEmpty()) {
            return;
        }

        log.info("发现 {} 个文档需要同步", changedDocs.size());

        for (DocumentRecord doc : changedDocs) {
            try {
                syncDocument(doc);
            } catch (Exception e) {
                log.error("文档同步失败: docId={}", doc.getId(), e);
                doc.setStatus(DocumentStatus.SYNC_FAILED);
                doc.setSyncError(e.getMessage());
                documentRepository.save(doc);
            }
        }
    }

    private void syncDocument(DocumentRecord doc) {
        // 1. 计算当前内容哈希
        String currentHash = DigestUtils.sha256Hex(doc.getContent());

        // 2. 如果哈希没变，跳过（防重复处理）
        if (currentHash.equals(doc.getLastSyncedHash())) {
            doc.setStatus(DocumentStatus.SYNCED);
            documentRepository.save(doc);
            return;
        }

        // 3. 删除旧向量（如果存在）
        if (doc.getLastSyncedHash() != null) {
            vectorStoreService.deleteByDocumentId(doc.getId());
            log.info("删除旧向量: docId={}", doc.getId());
        }

        // 4. 文档分块
        List<TextChunk> chunks = splitDocument(doc.getContent(), doc.getId());

        // 5. 批量生成 Embedding
        List<float[]> embeddings = embeddingService.batchEmbed(
            chunks.stream().map(TextChunk::getText).collect(Collectors.toList()));

        // 6. 写入向量库
        List<VectorRecord> vectors = new ArrayList<>();
        for (int i = 0; i < chunks.size(); i++) {
            VectorRecord vr = new VectorRecord();
            vr.setId(doc.getId() + "_chunk_" + i);
            vr.setDocumentId(doc.getId());
            vr.setChunkIndex(i);
            vr.setVector(embeddings.get(i));
            vr.setMetadata(buildMetadata(doc, chunks.get(i)));
            vr.setCreatedAt(System.currentTimeMillis());
            vectors.add(vr);
        }
        vectorStoreService.batchInsert(vectors);

        // 7. 更新同步状态
        doc.setLastSyncedHash(currentHash);
        doc.setLastSyncedAt(LocalDateTime.now());
        doc.setChunkCount(chunks.size());
        doc.setStatus(DocumentStatus.SYNCED);
        documentRepository.save(doc);

        log.info("文档同步完成: docId={}, chunks={}", doc.getId(), chunks.size());
    }

    /**
     * 文档分块策略：按段落切分，保留上下文重叠
     */
    private List<TextChunk> splitDocument(String content, Long docId) {
        List<TextChunk> chunks = new ArrayList<>();
        int chunkSize = 512;       // 每块最大字符数
        int overlapSize = 64;      // 相邻块重叠字符数

        // 先按段落分割
        String[] paragraphs = content.split("\n\n+");

        StringBuilder current = new StringBuilder();
        int chunkIndex = 0;

        for (String paragraph : paragraphs) {
            if (current.length() + paragraph.length() > chunkSize
                    && current.length() > 0) {
                // 当前块满了，保存并开始新块
                String chunkText = current.toString().trim();
                chunks.add(new TextChunk(docId + "_" + chunkIndex, chunkText));
                chunkIndex++;

                // 保留末尾的重叠内容
                String overlap = chunkText.length() > overlapSize
                    ? chunkText.substring(chunkText.length() - overlapSize)
                    : chunkText;
                current = new StringBuilder(overlap).append("\n\n");
            }
            current.append(paragraph).append("\n\n");
        }

        if (current.length() > 0) {
            chunks.add(new TextChunk(docId + "_" + chunkIndex,
                current.toString().trim()));
        }

        return chunks;
    }
}

2.3 监控内容变更的方式

对于存在 MySQL 里的文档，最优雅的方式是监听 binlog（使用 Canal/Debezium），把内容变更事件推到 Kafka，再由消费者触发增量同步，比轮询数据库更高效：

@Component
public class DocumentChangeEventListener {

    @Autowired
    private DocumentSyncService syncService;

    @KafkaListener(topics = "document-change-events",
                   groupId = "vector-sync-group")
    public void onDocumentChange(DocumentChangeEvent event) {
        if (event.getOperationType() == OperationType.DELETE) {
            // 文档删除：立即清理向量
            vectorStoreService.deleteByDocumentId(event.getDocumentId());
            log.info("文档已删除，清理向量: docId={}", event.getDocumentId());
            return;
        }

        // 插入或更新：标记为待同步，由定时任务批量处理
        documentRepository.markPendingSync(event.getDocumentId());
    }
}

三、索引更新策略

3.1 向量索引为什么需要更新

向量数据库（Milvus、Qdrant、Weaviate）在数据量增长到一定程度后，需要重建索引来维持查询性能。原因是：

HNSW 索引在频繁删除后会产生"孤立节点"，降低召回率
数据分布发生大幅变化（新增了大量不同领域的内容），原来的索引参数可能不再最优
删除操作在大多数向量数据库里是"软删除"，不立即释放存储，需要通过 Compact 来回收空间

3.2 索引健康度监控

在 Milvus 里，可以通过以下方式监控索引状态：

@Service
public class VectorIndexHealthService {

    @Autowired
    private MilvusClient milvusClient;

    /**
     * 检查集合的索引健康度
     */
    public IndexHealthReport checkIndexHealth(String collectionName) {
        // 获取集合统计信息
        R<GetCollectionStatisticsResponse> statsResp =
            milvusClient.getCollectionStatistics(
                GetCollectionStatisticsParam.newBuilder()
                    .withCollectionName(collectionName)
                    .build());

        long rowCount = Long.parseLong(statsResp.getData()
            .getStats(0).getValue());

        // 获取删除比例（需要自己统计，Milvus 不直接暴露）
        long deletedCount = getDeletedCount(collectionName);
        double deleteRatio = rowCount > 0 ?
            (double) deletedCount / (rowCount + deletedCount) : 0;

        IndexHealthReport report = new IndexHealthReport();
        report.setCollectionName(collectionName);
        report.setTotalCount(rowCount);
        report.setDeletedCount(deletedCount);
        report.setDeleteRatio(deleteRatio);

        // 判断是否需要 Compact
        report.setNeedsCompact(deleteRatio > 0.2);  // 删除超 20% 触发 Compact

        // 评估查询性能（抽样测试）
        long avgQueryLatencyMs = benchmarkQueryLatency(collectionName);
        report.setAvgQueryLatencyMs(avgQueryLatencyMs);
        report.setNeedsReindex(avgQueryLatencyMs > 200);  // 超 200ms 考虑重建索引

        return report;
    }

    /**
     * 执行 Compact（异步，不阻塞主流程）
     */
    public void triggerCompact(String collectionName) {
        R<ManualCompactionResponse> resp = milvusClient.manualCompact(
            ManualCompactParam.newBuilder()
                .withCollectionName(collectionName)
                .build());

        long compactionId = resp.getData().getCompactionID();
        log.info("触发 Compact: collectionName={}, compactionId={}",
            collectionName, compactionId);

        // 异步等待 Compact 完成
        CompletableFuture.runAsync(() -> waitForCompact(collectionName, compactionId));
    }

    private void waitForCompact(String collectionName, long compactionId) {
        for (int i = 0; i < 60; i++) {
            try {
                Thread.sleep(10000);
            } catch (InterruptedException e) {
                Thread.currentThread().interrupt();
                return;
            }
            R<GetCompactionStateResponse> stateResp = milvusClient.getCompactionState(
                GetCompactionStateParam.newBuilder()
                    .withCompactionID(compactionId)
                    .build());

            if (stateResp.getData().getState() == CompactionState.Completed) {
                log.info("Compact 完成: collectionName={}", collectionName);
                return;
            }
        }
        log.warn("Compact 超时: collectionName={}, compactionId={}",
            collectionName, compactionId);
    }

    private long benchmarkQueryLatency(String collectionName) {
        // 用随机向量做几次查询，取平均耗时
        float[] randomVector = generateRandomVector(1536);
        int trials = 5;
        long totalMs = 0;

        for (int i = 0; i < trials; i++) {
            long start = System.currentTimeMillis();
            milvusClient.search(SearchParam.newBuilder()
                .withCollectionName(collectionName)
                .withVectors(Collections.singletonList(randomVector))
                .withTopK(10)
                .build());
            totalMs += System.currentTimeMillis() - start;
        }

        return totalMs / trials;
    }
}

3.3 热索引与冷索引切换（不停服重建）

重建索引是高危操作——重建期间查询性能下降，甚至无法查询。解决方案是蓝绿索引策略：

@Service
public class VectorIndexMigrationService {

    @Autowired
    private MilvusClient milvusClient;

    @Autowired
    private TrafficSplitter trafficSplitter;

    /**
     * 蓝绿切换：不停服重建索引
     */
    public void migrateIndex(String sourceCollection, String targetCollection) {
        log.info("开始索引迁移: {} -> {}", sourceCollection, targetCollection);

        // 1. 从源集合全量导出数据，重新 Embedding 并写入目标集合
        // 实际实现中，这是一个耗时的批量操作，需要异步执行
        exportAndRebuild(sourceCollection, targetCollection);

        // 2. 增量同步：迁移期间新产生的数据同步到两个集合
        trafficSplitter.enableDualWrite(sourceCollection, targetCollection);

        // 3. 性能测试：对比两个集合的查询延迟和召回率
        MigrationQualityReport report = compareMigrationQuality(
            sourceCollection, targetCollection);

        if (!report.isPassed()) {
            log.error("迁移质量检查未通过: {}", report);
            trafficSplitter.disableDualWrite();
            return;
        }

        // 4. 灰度切流：10% -> 50% -> 100%
        int[] trafficRatios = {10, 50, 100};
        for (int ratio : trafficRatios) {
            trafficSplitter.setTrafficRatio(targetCollection, ratio);
            log.info("切流到 {}%: 目标集合={}", ratio, targetCollection);

            // 观察 5 分钟，检查错误率
            try {
                Thread.sleep(300000);
            } catch (InterruptedException e) {
                Thread.currentThread().interrupt();
                return;
            }

            if (trafficSplitter.getErrorRate(targetCollection) > 0.01) {
                log.error("迁移过程中错误率过高，回滚");
                trafficSplitter.rollback(sourceCollection);
                return;
            }
        }

        // 5. 迁移完成，停用源集合
        trafficSplitter.disableDualWrite();
        log.info("索引迁移完成，下线源集合: {}", sourceCollection);
        // 保留源集合一周再删除，以防需要回滚
    }
}

四、向量数据的过期淘汰

4.1 哪些向量数据应该被淘汰

并不是所有"老"的向量数据都应该淘汰，关键看业务语义：

时效性内容：新闻资讯、活动公告、价格信息——有明确的过期时间
被替代的内容：旧版本的文档、已更新的 FAQ——原始内容被新版本替代后，旧向量应删除
低访问频率内容：长时间没有被查询命中的向量——可以考虑"冷归档"
业务下线内容：产品下线、部门解散——对应的知识库向量需要主动清理

4.2 TTL 机制的实现

@Entity
@Table(name = "vector_metadata")
public class VectorMetadata {

    @Id
    private String vectorId;      // 对应向量库里的 ID

    private Long documentId;       // 关联的原始文档
    private String collectionName; // 向量库集合名

    private LocalDateTime createdAt;
    private LocalDateTime updatedAt;
    private LocalDateTime lastHitAt;     // 最后一次被查询命中的时间
    private Long hitCount;               // 累计命中次数

    private LocalDateTime expiresAt;     // TTL 过期时间（可为 null 表示永不过期）
    private Boolean tombstone;           // 逻辑删除标记

    // 内容类型，影响淘汰策略
    private String contentType;   // NEWS, FAQ, DOCUMENT, POLICY, etc.
}

@Service
public class VectorExpirationService {

    @Autowired
    private VectorMetadataRepository metadataRepository;

    @Autowired
    private VectorStoreService vectorStoreService;

    /**
     * 过期淘汰定时任务：每小时运行一次
     */
    @Scheduled(cron = "0 0 * * * ?")
    public void expireVectors() {
        LocalDateTime now = LocalDateTime.now();

        // 1. 处理 TTL 过期的向量
        List<VectorMetadata> ttlExpired = metadataRepository
            .findByExpiresAtBeforeAndTombstoneFalse(now);

        log.info("TTL 过期向量: {} 个", ttlExpired.size());

        for (VectorMetadata meta : ttlExpired) {
            try {
                vectorStoreService.delete(meta.getCollectionName(), meta.getVectorId());
                meta.setTombstone(true);
                metadataRepository.save(meta);
            } catch (Exception e) {
                log.error("删除过期向量失败: {}", meta.getVectorId(), e);
            }
        }

        // 2. 处理长期不活跃的向量（LRU 淘汰）
        LocalDateTime inactiveThreshold = now.minusDays(90);  // 90 天未命中
        List<VectorMetadata> inactiveVectors = metadataRepository
            .findByLastHitAtBeforeAndContentTypeAndTombstoneFalse(
                inactiveThreshold, "NEWS");  // 新闻类内容用更激进的淘汰策略

        log.info("长期不活跃向量: {} 个", inactiveVectors.size());

        // 不立即删除，先降权（降低召回排名）或归档
        for (VectorMetadata meta : inactiveVectors) {
            archiveInactiveVector(meta);
        }
    }

    private void archiveInactiveVector(VectorMetadata meta) {
        // 把向量数据从热存储转移到冷存储
        // 热存储：Milvus（内存+SSD，低延迟）
        // 冷存储：文件系统（只在需要时才加载）
        String vectorData = vectorStoreService.exportVector(
            meta.getCollectionName(), meta.getVectorId());

        coldStorageService.archive(meta.getVectorId(), vectorData);

        // 从热存储删除
        vectorStoreService.delete(meta.getCollectionName(), meta.getVectorId());

        meta.setTombstone(true);
        meta.setArchived(true);
        metadataRepository.save(meta);

        log.info("向量归档: vectorId={}", meta.getVectorId());
    }

    /**
     * 查询命中回调：更新命中记录
     */
    public void recordHit(String vectorId) {
        metadataRepository.updateLastHit(vectorId, LocalDateTime.now());
    }
}

4.3 内容类型与淘汰策略的对应关系

不同类型的内容需要不同的淘汰策略：

public enum ContentExpirationPolicy {

    NEWS("新闻资讯", 7, 30),          // TTL 7 天，30 天不活跃则归档
    ACTIVITY("活动公告", 30, 60),      // TTL 30 天
    FAQ("常见问题", -1, 180),          // 无 TTL，180 天不活跃则归档
    POLICY("政策制度", -1, 365),       // 无 TTL，365 天不活跃则归档
    PRODUCT_DOC("产品文档", -1, 180);  // 无 TTL，180 天不活跃则归档

    private final String name;
    private final int ttlDays;          // -1 表示不过期
    private final int inactiveDays;     // 不活跃多少天后触发归档

    ContentExpirationPolicy(String name, int ttlDays, int inactiveDays) {
        this.name = name;
        this.ttlDays = ttlDays;
        this.inactiveDays = inactiveDays;
    }
}

@Service
public class VectorLifecycleManager {

    public void onDocumentIndexed(VectorMetadata meta, String contentType) {
        ContentExpirationPolicy policy = ContentExpirationPolicy.valueOf(contentType);

        if (policy.getTtlDays() > 0) {
            meta.setExpiresAt(LocalDateTime.now().plusDays(policy.getTtlDays()));
        }

        metadataRepository.save(meta);
    }
}

五、查询命中率的监控与反馈

5.1 向量查询的可观测性

向量库的查询命中情况是评估数据质量的重要信号：哪些向量经常被命中、哪些从来不被命中、平均召回排名如何。

@Service
public class VectorQueryMetricsService {

    @Autowired
    private MeterRegistry meterRegistry;

    /**
     * 记录一次查询的指标
     */
    public void recordQuery(VectorQueryResult result) {
        // 命中数量
        meterRegistry.counter("vector.query.hits",
            "collection", result.getCollectionName())
            .increment(result.getHits().size());

        // 查询延迟
        meterRegistry.timer("vector.query.latency",
            "collection", result.getCollectionName())
            .record(result.getLatencyMs(), TimeUnit.MILLISECONDS);

        // 最高相似度分（衡量召回质量）
        if (!result.getHits().isEmpty()) {
            double topScore = result.getHits().get(0).getScore();
            meterRegistry.gauge("vector.query.top_score",
                Tags.of("collection", result.getCollectionName()),
                topScore);

            // 相似度过低说明查询内容和索引内容差异较大（数据漂移信号）
            if (topScore < 0.6) {
                log.warn("向量查询相似度偏低: score={}, query={}",
                    topScore, result.getQueryText());
            }
        }

        // 更新每个命中向量的命中记录
        result.getHits().forEach(hit ->
            vectorExpirationService.recordHit(hit.getVectorId()));
    }

    /**
     * 每周生成向量使用报告
     */
    @Scheduled(cron = "0 0 8 * * MON")
    public void generateWeeklyReport() {
        List<String> collections = vectorStoreService.listCollections();

        for (String collection : collections) {
            // 统计过去 7 天命中率为 0 的向量
            long totalVectors = vectorMetadataRepository
                .countByCollectionName(collection);
            long neverHitVectors = vectorMetadataRepository
                .countByCollectionNameAndHitCountEquals(collection, 0L);

            double neverHitRatio = totalVectors > 0 ?
                (double) neverHitVectors / totalVectors : 0;

            log.info("向量使用报告 - 集合: {}, 总量: {}, 从未命中: {} ({:.1f}%)",
                collection, totalVectors, neverHitVectors,
                neverHitRatio * 100);

            if (neverHitRatio > 0.5) {
                alertService.sendAlert(AlertLevel.WARNING,
                    String.format("集合 %s 有超过 50%% 的向量从未被命中，" +
                        "建议检查数据质量和分块策略", collection));
            }
        }
    }
}

六、实战中踩的坑

坑一：软删除积累导致查询结果"污染"

Milvus 的删除操作是软删除，被删除的向量在 Compact 之前仍然参与查询，只是在返回结果时被过滤掉。如果删除量很大，查询实际上做了大量无用的计算，Compact 不及时会导致性能持续恶化。

解决：建立删除比例监控，超过 20% 自动触发 Compact。

坑二：Embedding 模型升级后忘记重建索引

把 Embedding 模型从 text-embedding-ada-002 升级到 text-embedding-3-small，向量空间发生了变化。如果只更新了新增文档的 Embedding，旧文档还是旧模型的向量，两者在同一个索引里共存，查询结果会严重劣化。

解决：模型版本和向量数据绑定，Embedding 模型升级必须触发全量重建。

// 向量元数据里记录 Embedding 模型版本
public class VectorMetadata {
    private String embeddingModelVersion;  // 如 "text-embedding-3-small-v1"
}

// 查询时只使用与当前模型版本匹配的向量
public List<SearchResult> search(String query) {
    String currentModelVersion = embeddingService.getCurrentModelVersion();
    String queryFilter = "embeddingModelVersion == \"" + currentModelVersion + "\"";

    return milvusClient.search(SearchParam.newBuilder()
        .withExpr(queryFilter)  // 过滤不同版本的向量
        ...
        .build());
}

坑三：分块策略变化导致语义断裂

分块太短（比如 128 字符），一个完整的业务流程被切成了十几个碎片，每个碎片单独查询时语义不完整，命中率很差。分块太长（超过 1000 字符），不同话题的内容混在一起，查询精度下降。

解决：根据内容类型使用不同的分块策略。结构化文档（政策、合同）用语义段落切分；FAQ 按问答对切分；新闻按段落切分，保持上下文重叠。

七、小结

向量数据的生命周期管理是一个容易被忽视但影响深远的工程问题。搭好了增量同步、索引维护、过期淘汰这三套机制，知识库才算真正进入了良性运转的状态。

特别是过期淘汰，别等到存储爆了再去想办法。从一开始就给每类内容定好生命周期策略，早规划早受益。