第1667篇：文档预处理的工程化——分块策略、清洗规则与元数据设计

老张2026/4/30大约 13 分钟

第1667篇：文档预处理的工程化——分块策略、清洗规则与元数据设计

很多团队搭RAG系统遇到问题，第一反应是换更好的模型、调整检索参数、优化提示词。但我发现有相当一部分"检索质量差"的问题，根源在更前面——文档预处理做得不好。

垃圾进，垃圾出。

我们接手过一个客户的项目，他们的RAG系统检索命中率只有40%。我们花了两天分析，发现主要原因不是检索算法的问题，而是分块策略导致了大量的"语义碎片"——一个完整的代码示例被截断成了三块，每一块单独拿出来都不知道是什么，向量化之后自然找不到。

把分块策略改了之后，不换任何其他组件，检索命中率直接从40%升到68%。

今天系统讲文档预处理这个话题。

一、分块策略的核心问题

1.1 为什么分块是个难题？

理想状态下，每个Chunk应该：

语义完整：包含一个独立的知识单元，开头和结尾都有意义
大小适中：不能太长（LLM上下文有限，检索精度下降），不能太短（缺乏语境，向量语义弱）
互不重叠但有关联：相关的内容应该能被同时检索到

但实际的文档千变万化，没有一种分块策略能适合所有情况。

1.2 常见分块策略对比

二、分块策略的工程实现

2.1 基础：固定大小分块（别用这个作为唯一方案）

@Component
public class FixedSizeChunker implements DocumentChunker {
    
    private final int chunkSize;
    private final int overlap;
    
    public FixedSizeChunker(int chunkSize, int overlap) {
        this.chunkSize = chunkSize;
        this.overlap = overlap;
    }
    
    @Override
    public List<Chunk> chunk(String content, Map<String, Object> metadata) {
        List<Chunk> chunks = new ArrayList<>();
        int start = 0;
        int chunkIndex = 0;
        
        while (start < content.length()) {
            int end = Math.min(start + chunkSize, content.length());
            
            // 尽量在句子边界截断，避免切断词语
            if (end < content.length()) {
                end = findSentenceBoundary(content, end, 50);
            }
            
            String chunkContent = content.substring(start, end).trim();
            
            if (!chunkContent.isBlank()) {
                Map<String, Object> chunkMeta = new HashMap<>(metadata);
                chunkMeta.put("chunk_index", chunkIndex++);
                chunkMeta.put("chunk_start", start);
                chunkMeta.put("chunk_end", end);
                
                chunks.add(Chunk.builder()
                    .content(chunkContent)
                    .metadata(chunkMeta)
                    .build());
            }
            
            // 下一块的起始位置（有重叠）
            start = end - overlap;
            if (start <= 0 || start >= content.length()) break;
        }
        
        return chunks;
    }
    
    /**
     * 在目标位置附近找句子边界
     */
    private int findSentenceBoundary(String text, int targetPos, int searchRange) {
        String punctuations = "。！？.!?\n";
        
        // 在目标位置附近向前找句号
        for (int i = targetPos; i >= Math.max(0, targetPos - searchRange); i--) {
            if (punctuations.indexOf(text.charAt(i)) >= 0) {
                return i + 1;
            }
        }
        
        return targetPos;  // 找不到句子边界就用原始位置
    }
}

2.2 递归语义分块（推荐的通用方案）

这是我们用得最多的方案，按照文档的自然结构层次分块：

@Component
public class RecursiveSemanticChunker implements DocumentChunker {
    
    // 分块时按顺序尝试的分隔符（从粗粒度到细粒度）
    private static final List<String> SPLIT_SEPARATORS = Arrays.asList(
        "\n\n\n",   // 多空行（章节间）
        "\n\n",     // 双空行（段落间）
        "\n",       // 单空行/换行
        "。",       // 中文句号
        "！",       // 感叹号
        "？",       // 问号
        ". ",       // 英文句号（后跟空格）
        " "         // 空格（最后手段）
    );
    
    private final int maxChunkSize;
    private final int minChunkSize;
    private final int overlap;
    
    public RecursiveSemanticChunker(int maxChunkSize, int minChunkSize, int overlap) {
        this.maxChunkSize = maxChunkSize;
        this.minChunkSize = minChunkSize;
        this.overlap = overlap;
    }
    
    @Override
    public List<Chunk> chunk(String content, Map<String, Object> metadata) {
        List<String> rawChunks = splitRecursively(content, 0);
        
        // 合并过小的块，分割过大的块
        List<String> normalizedChunks = normalizeChunks(rawChunks);
        
        List<Chunk> result = new ArrayList<>();
        for (int i = 0; i < normalizedChunks.size(); i++) {
            Map<String, Object> chunkMeta = new HashMap<>(metadata);
            chunkMeta.put("chunk_index", i);
            chunkMeta.put("total_chunks", normalizedChunks.size());
            
            result.add(Chunk.builder()
                .content(normalizedChunks.get(i))
                .metadata(chunkMeta)
                .build());
        }
        
        return result;
    }
    
    /**
     * 递归分割
     * 先用大粒度分隔符分，如果块还是太大，用更细粒度的分隔符再分
     */
    private List<String> splitRecursively(String text, int separatorIndex) {
        if (text.length() <= maxChunkSize) {
            return Collections.singletonList(text);
        }
        
        if (separatorIndex >= SPLIT_SEPARATORS.size()) {
            // 所有分隔符都试过了，强制按大小切
            return forceChunk(text);
        }
        
        String separator = SPLIT_SEPARATORS.get(separatorIndex);
        String[] parts = text.split(Pattern.quote(separator), -1);
        
        if (parts.length <= 1) {
            // 当前分隔符不存在，尝试下一个
            return splitRecursively(text, separatorIndex + 1);
        }
        
        List<String> result = new ArrayList<>();
        for (String part : parts) {
            String trimmed = part.trim();
            if (trimmed.isEmpty()) continue;
            
            if (trimmed.length() > maxChunkSize) {
                // 这一块还是太大，用更细粒度的分隔符继续分
                result.addAll(splitRecursively(trimmed, separatorIndex + 1));
            } else {
                result.add(trimmed);
            }
        }
        
        return result;
    }
    
    /**
     * 归一化：合并太小的块，大块保持不动
     */
    private List<String> normalizeChunks(List<String> chunks) {
        List<String> result = new ArrayList<>();
        StringBuilder current = new StringBuilder();
        
        for (String chunk : chunks) {
            if (current.length() + chunk.length() <= maxChunkSize) {
                if (current.length() > 0) current.append("\n\n");
                current.append(chunk);
            } else {
                if (current.length() >= minChunkSize) {
                    result.add(current.toString());
                    current = new StringBuilder(chunk);
                } else {
                    // 当前块太小，强制与下一块合并
                    if (current.length() > 0) current.append("\n\n");
                    current.append(chunk);
                    if (current.length() >= minChunkSize) {
                        result.add(current.toString());
                        current = new StringBuilder();
                    }
                }
            }
        }
        
        if (current.length() > 0) {
            result.add(current.toString());
        }
        
        return result;
    }
    
    private List<String> forceChunk(String text) {
        List<String> result = new ArrayList<>();
        for (int i = 0; i < text.length(); i += maxChunkSize - overlap) {
            result.add(text.substring(i, Math.min(i + maxChunkSize, text.length())));
        }
        return result;
    }
}

2.3 代码感知分块（Code-Aware Chunker）

代码文档是个特殊情况，函数/类是自然的分块单元：

@Component
public class CodeAwareChunker implements DocumentChunker {
    
    // 代码块的标识模式
    private static final Pattern CODE_BLOCK_PATTERN = 
        Pattern.compile("```[\\s\\S]*?```", Pattern.DOTALL);
    
    private static final Pattern JAVA_METHOD_PATTERN = 
        Pattern.compile("((?:public|private|protected|static|final|abstract)\\s+)" +
                       "(?:[\\w<>\\[\\]]+\\s+)" +
                       "(\\w+)\\s*\\([^)]*\\)\\s*\\{", Pattern.MULTILINE);
    
    @Override
    public List<Chunk> chunk(String content, Map<String, Object> metadata) {
        // 判断是否是代码文档
        String docType = (String) metadata.getOrDefault("doc_type", "");
        
        if ("code".equals(docType) || content.contains("```")) {
            return chunkCodeDocument(content, metadata);
        }
        
        // 非代码文档，走通用分块
        return new RecursiveSemanticChunker(800, 100, 100)
            .chunk(content, metadata);
    }
    
    private List<Chunk> chunkCodeDocument(String content, Map<String, Object> metadata) {
        List<Chunk> chunks = new ArrayList<>();
        
        // 找出所有代码块
        Matcher codeMatcher = CODE_BLOCK_PATTERN.matcher(content);
        int lastEnd = 0;
        int chunkIndex = 0;
        
        while (codeMatcher.find()) {
            // 代码块前的文本
            String textBefore = content.substring(lastEnd, codeMatcher.start()).trim();
            if (!textBefore.isBlank()) {
                Map<String, Object> textMeta = new HashMap<>(metadata);
                textMeta.put("chunk_type", "text");
                textMeta.put("chunk_index", chunkIndex++);
                
                chunks.add(Chunk.builder()
                    .content(textBefore)
                    .metadata(textMeta)
                    .build());
            }
            
            // 代码块本身（保持完整）
            String codeBlock = codeMatcher.group();
            
            // 如果代码块太长，尝试按函数/方法分割
            if (codeBlock.length() > 2000) {
                chunks.addAll(splitLargeCodeBlock(codeBlock, metadata, chunkIndex));
                chunkIndex += 5;  // 粗略估计
            } else {
                Map<String, Object> codeMeta = new HashMap<>(metadata);
                codeMeta.put("chunk_type", "code");
                codeMeta.put("chunk_index", chunkIndex++);
                
                chunks.add(Chunk.builder()
                    .content(codeBlock)
                    .metadata(codeMeta)
                    .build());
            }
            
            lastEnd = codeMatcher.end();
        }
        
        // 最后剩余的文本
        String remaining = content.substring(lastEnd).trim();
        if (!remaining.isBlank()) {
            Map<String, Object> textMeta = new HashMap<>(metadata);
            textMeta.put("chunk_type", "text");
            textMeta.put("chunk_index", chunkIndex);
            
            chunks.add(Chunk.builder()
                .content(remaining)
                .metadata(textMeta)
                .build());
        }
        
        return chunks;
    }
    
    private List<Chunk> splitLargeCodeBlock(String codeBlock, 
                                             Map<String, Object> metadata, 
                                             int startIndex) {
        // 按函数/方法分割（简化实现）
        String[] lines = codeBlock.split("\n");
        List<Chunk> result = new ArrayList<>();
        StringBuilder currentFunc = new StringBuilder();
        int braceDepth = 0;
        int chunkIndex = startIndex;
        
        for (String line : lines) {
            currentFunc.append(line).append("\n");
            braceDepth += countChar(line, '{') - countChar(line, '}');
            
            // 一个顶层方法或类结束
            if (braceDepth == 0 && currentFunc.length() > 100) {
                Map<String, Object> codeMeta = new HashMap<>(metadata);
                codeMeta.put("chunk_type", "code_function");
                codeMeta.put("chunk_index", chunkIndex++);
                
                result.add(Chunk.builder()
                    .content(currentFunc.toString().trim())
                    .metadata(codeMeta)
                    .build());
                
                currentFunc = new StringBuilder();
            }
        }
        
        if (currentFunc.length() > 0) {
            Map<String, Object> codeMeta = new HashMap<>(metadata);
            codeMeta.put("chunk_type", "code_function");
            codeMeta.put("chunk_index", chunkIndex);
            
            result.add(Chunk.builder()
                .content(currentFunc.toString().trim())
                .metadata(codeMeta)
                .build());
        }
        
        return result;
    }
    
    private int countChar(String s, char c) {
        int count = 0;
        for (char ch : s.toCharArray()) if (ch == c) count++;
        return count;
    }
}

三、文档清洗规则

分块之前，需要对原始文档做清洗。这一步经常被忽视，但对向量质量影响很大。

@Service
public class DocumentCleaningPipeline {
    
    private final List<CleaningRule> rules;
    
    public DocumentCleaningPipeline() {
        this.rules = Arrays.asList(
            new RemoveHeaderFooterRule(),
            new NormalizeWhitespaceRule(),
            new RemoveSpecialCharsRule(),
            new FixEncodingRule(),
            new RemoveDuplicateLinesRule(),
            new NormalizeNumbersRule()
        );
    }
    
    public CleanedDocument clean(RawDocument rawDoc) {
        String content = rawDoc.getContent();
        List<String> appliedRules = new ArrayList<>();
        
        for (CleaningRule rule : rules) {
            String before = content;
            content = rule.apply(content, rawDoc.getMetadata());
            
            if (!content.equals(before)) {
                appliedRules.add(rule.getName());
            }
        }
        
        return CleanedDocument.builder()
            .content(content)
            .originalLength(rawDoc.getContent().length())
            .cleanedLength(content.length())
            .appliedRules(appliedRules)
            .metadata(rawDoc.getMetadata())
            .build();
    }
}

/**
 * 规则1：去除页眉页脚（PDF转换后常见的噪声）
 */
class RemoveHeaderFooterRule implements CleaningRule {
    
    // 常见页眉页脚模式
    private static final List<Pattern> HEADER_FOOTER_PATTERNS = Arrays.asList(
        Pattern.compile("^第\\s*\\d+\\s*页.*$", Pattern.MULTILINE),
        Pattern.compile("^Page\\s+\\d+.*$", Pattern.MULTILINE | Pattern.CASE_INSENSITIVE),
        Pattern.compile("^.*版权所有.*$", Pattern.MULTILINE),
        Pattern.compile("^.*All Rights Reserved.*$", Pattern.MULTILINE | Pattern.CASE_INSENSITIVE),
        Pattern.compile("^.{1,50}\\s*\\|\\s*.{1,50}\\s*\\|\\s*\\d{4}.*$", Pattern.MULTILINE)
    );
    
    @Override
    public String apply(String content, Map<String, Object> metadata) {
        for (Pattern pattern : HEADER_FOOTER_PATTERNS) {
            content = pattern.matcher(content).replaceAll("");
        }
        return content;
    }
    
    @Override
    public String getName() { return "RemoveHeaderFooter"; }
}

/**
 * 规则2：规范化空白字符
 */
class NormalizeWhitespaceRule implements CleaningRule {
    
    @Override
    public String apply(String content, Map<String, Object> metadata) {
        return content
            .replaceAll("[ \\t]+", " ")           // 多个空格/制表符 -> 单个空格
            .replaceAll("\\n{3,}", "\n\n")          // 三个以上换行 -> 两个换行
            .replaceAll("\\r\\n", "\n")             // Windows换行 -> Unix换行
            .trim();
    }
    
    @Override
    public String getName() { return "NormalizeWhitespace"; }
}

/**
 * 规则3：去除无意义的特殊字符
 */
class RemoveSpecialCharsRule implements CleaningRule {
    
    @Override
    public String apply(String content, Map<String, Object> metadata) {
        return content
            // 去除控制字符（保留换行和制表符）
            .replaceAll("[\\x00-\\x08\\x0B\\x0C\\x0E-\\x1F\\x7F]", "")
            // 去除零宽字符
            .replaceAll("[\\u200B-\\u200D\\uFEFF]", "")
            // 规范化引号
            .replaceAll("[""「」]", "\"")
            .replaceAll("[''『』]", "'");
    }
    
    @Override
    public String getName() { return "RemoveSpecialChars"; }
}

/**
 * 规则4：去除重复行（某些文档格式转换后会产生）
 */
class RemoveDuplicateLinesRule implements CleaningRule {
    
    @Override
    public String apply(String content, Map<String, Object> metadata) {
        String[] lines = content.split("\n");
        List<String> dedupedLines = new ArrayList<>();
        String prevLine = null;
        
        for (String line : lines) {
            // 去除连续重复行，但允许重复空行（段落分隔）
            if (!line.trim().equals(prevLine != null ? prevLine.trim() : null) ||
                line.trim().isEmpty()) {
                dedupedLines.add(line);
            }
            prevLine = line;
        }
        
        return String.join("\n", dedupedLines);
    }
    
    @Override
    public String getName() { return "RemoveDuplicateLines"; }
}

四、元数据设计——检索效果的倍增器

元数据是RAG系统里很多人没有充分利用的武器。好的元数据设计可以：

支持过滤检索（只在特定部门的文档里检索）
支持时效性控制（优先最新的文档）
支持来源权威性加权（官方文档比Wiki权重高）
支持检索结果的可解释性（告诉用户答案来自哪里）

4.1 元数据Schema设计

@Data
@Builder
public class DocumentMetadata {
    
    // ===== 来源信息 =====
    private String docId;           // 文档唯一ID
    private String sourceUrl;       // 原始URL或文件路径
    private String sourceType;      // pdf/word/markdown/confluence/jira/...
    private String sourceDomain;    // 所属系统/平台
    
    // ===== 内容属性 =====
    private String title;           // 文档标题
    private String docType;         // 技术文档/规章制度/会议纪要/FAQ/...
    private String language;        // zh/en
    private List<String> tags;      // 业务标签
    private String category;        // 一级分类
    private String subCategory;     // 二级分类
    
    // ===== 时间信息 =====
    private LocalDateTime createTime;       // 创建时间
    private LocalDateTime updateTime;       // 最后更新时间
    private LocalDateTime indexTime;        // 入库时间
    private LocalDate effectiveFrom;        // 生效日期（制度类文档）
    private LocalDate expiresAt;            // 失效日期
    
    // ===== 权威性信息 =====
    private String authorDept;      // 作者部门
    private String authorName;      // 作者
    private int authorityLevel;     // 权威级别：1=官方/2=部门/3=个人
    
    // ===== 分块信息 =====
    private int chunkIndex;         // 该块在文档中的序号
    private int totalChunks;        // 文档总块数
    private String prevChunkId;     // 上一块的ID（支持上下文拼接）
    private String nextChunkId;     // 下一块的ID
    private String chunkType;       // text/code/table/image_caption
    
    // ===== 质量信息 =====
    private double qualityScore;    // 文档质量分（0-1）
    private boolean isVerified;     // 是否经过人工验证
    private int viewCount;          // 查看次数（热度指标）
    private double userRating;      // 用户评分
    
    // ===== 访问控制 =====
    private List<String> allowedDepts;    // 允许访问的部门（null表示全员可见）
    private String securityLevel;          // 公开/内部/机密
}

4.2 基于元数据的过滤检索

@Service
public class MetadataFilteredSearchService {
    
    @Autowired
    private VectorStore vectorStore;
    
    /**
     * 带元数据过滤的检索
     */
    public List<Document> searchWithFilters(String query, SearchFilters filters) {
        // 构建过滤条件
        FilterExpression filter = buildFilterExpression(filters);
        
        SearchRequest request = SearchRequest.query(query)
            .withTopK(filters.getTopK())
            .withSimilarityThreshold(filters.getMinSimilarity())
            .withFilterExpression(filter);
        
        List<Document> results = vectorStore.similaritySearch(request);
        
        // 应用时效性衰减：越新的文档越靠前
        if (filters.isTimeDecayEnabled()) {
            results = applyTimeDecay(results);
        }
        
        // 应用权威性加权
        if (filters.isAuthorityWeightEnabled()) {
            results = applyAuthorityWeight(results);
        }
        
        return results;
    }
    
    private FilterExpression buildFilterExpression(SearchFilters filters) {
        List<FilterExpression> conditions = new ArrayList<>();
        
        // 部门过滤
        if (filters.getDeptFilter() != null) {
            conditions.add(FilterExpression.eq("category", filters.getDeptFilter()));
        }
        
        // 文档类型过滤
        if (filters.getDocTypeFilter() != null) {
            conditions.add(FilterExpression.in("doc_type", filters.getDocTypeFilter()));
        }
        
        // 时间范围过滤
        if (filters.getUpdateAfter() != null) {
            conditions.add(FilterExpression.gte("update_time", 
                filters.getUpdateAfter().toEpochDay()));
        }
        
        // 排除过期文档
        conditions.add(FilterExpression.or(
            FilterExpression.isNull("expires_at"),
            FilterExpression.gt("expires_at", LocalDate.now().toEpochDay())
        ));
        
        // 访问权限过滤
        if (filters.getUserDept() != null) {
            conditions.add(FilterExpression.or(
                FilterExpression.isNull("allowed_depts"),
                FilterExpression.contains("allowed_depts", filters.getUserDept())
            ));
        }
        
        if (conditions.isEmpty()) return null;
        return conditions.stream().reduce(FilterExpression::and).orElse(null);
    }
    
    /**
     * 时效性衰减：越新的文档得分越高
     */
    private List<Document> applyTimeDecay(List<Document> docs) {
        LocalDateTime now = LocalDateTime.now();
        
        return docs.stream()
            .map(doc -> {
                LocalDateTime updateTime = parseTime(doc.getMetadata().get("update_time"));
                if (updateTime == null) return doc;
                
                long daysSinceUpdate = ChronoUnit.DAYS.between(updateTime, now);
                double timeDecayFactor = Math.exp(-0.005 * daysSinceUpdate);  // 指数衰减
                
                double adjustedScore = doc.getScore() * timeDecayFactor;
                return doc.withScore(adjustedScore);
            })
            .sorted(Comparator.comparingDouble(Document::getScore).reversed())
            .collect(Collectors.toList());
    }
    
    /**
     * 权威性加权：官方文档得分加成
     */
    private List<Document> applyAuthorityWeight(List<Document> docs) {
        return docs.stream()
            .map(doc -> {
                int authorityLevel = (int) doc.getMetadata().getOrDefault("authority_level", 2);
                double authorityBonus = switch (authorityLevel) {
                    case 1 -> 0.15;   // 官方文档 +15%
                    case 2 -> 0.0;    // 部门文档 不变
                    case 3 -> -0.1;   // 个人文档 -10%
                    default -> 0.0;
                };
                return doc.withScore(doc.getScore() * (1 + authorityBonus));
            })
            .sorted(Comparator.comparingDouble(Document::getScore).reversed())
            .collect(Collectors.toList());
    }
    
    private LocalDateTime parseTime(Object timeObj) {
        if (timeObj instanceof LocalDateTime) return (LocalDateTime) timeObj;
        if (timeObj instanceof String) {
            try {
                return LocalDateTime.parse((String) timeObj);
            } catch (Exception e) { return null; }
        }
        return null;
    }
}

五、父子分块策略（Parent-Child Chunking）

这是我认为目前最好的分块方案之一，解决了"小块好检索、大块好生成"的矛盾：

子块（Child Chunk）：小粒度（200-400字符），用于向量检索，保证检索精度
父块（Parent Chunk）：大粒度（1000-2000字符），用于生成上下文，保证语义完整

检索时用子块，生成时换成对应的父块。

@Service
public class ParentChildChunkingService {
    
    @Autowired
    private RecursiveSemanticChunker parentChunker;
    
    @Autowired
    private RecursiveSemanticChunker childChunker;
    
    @Autowired
    private VectorStore vectorStore;
    
    @Autowired
    private DocumentStore documentStore;  // 存储原始父块
    
    /**
     * 构建父子分块索引
     */
    public void indexWithParentChild(Document rawDoc) {
        // 创建父块（大粒度）
        List<Chunk> parentChunks = parentChunker.chunk(
            rawDoc.getContent(), rawDoc.getMetadata()
        );
        
        for (Chunk parentChunk : parentChunks) {
            String parentId = generateId(rawDoc.getId(), parentChunk.getIndex());
            
            // 存储父块（不入向量库，只在文档存储里）
            documentStore.save(parentId, parentChunk);
            
            // 在父块内创建子块
            Map<String, Object> childMeta = new HashMap<>(parentChunk.getMetadata());
            childMeta.put("parent_chunk_id", parentId);
            
            List<Chunk> childChunks = childChunker.chunk(
                parentChunk.getContent(), childMeta
            );
            
            // 子块入向量库（用于检索）
            for (Chunk childChunk : childChunks) {
                vectorStore.add(Document.builder()
                    .content(childChunk.getContent())
                    .metadata(childChunk.getMetadata())
                    .build());
            }
            
            log.debug("父块 {} 创建了 {} 个子块", parentId, childChunks.size());
        }
    }
    
    /**
     * 检索时：用子块找，返回父块
     */
    public List<Document> searchWithParentRetrieval(String query, int topK) {
        // 1. 用子块进行向量检索
        List<Document> childResults = vectorStore.similaritySearch(query, topK * 2);
        
        // 2. 根据父块ID去重，每个父块只取一个子块（得分最高的）
        Map<String, Document> parentIdToChildMap = new LinkedHashMap<>();
        
        for (Document childDoc : childResults) {
            String parentId = (String) childDoc.getMetadata().get("parent_chunk_id");
            if (parentId == null) {
                // 没有父块，直接用子块
                parentIdToChildMap.put(childDoc.getId(), childDoc);
            } else {
                parentIdToChildMap.putIfAbsent(parentId, childDoc);
            }
        }
        
        // 3. 取Top-K个父块，加载父块内容
        return parentIdToChildMap.entrySet().stream()
            .limit(topK)
            .map(entry -> {
                String parentId = entry.getKey();
                Document childDoc = entry.getValue();
                
                // 从文档存储中加载父块
                Optional<Chunk> parentChunk = documentStore.load(parentId);
                
                if (parentChunk.isPresent()) {
                    // 返回父块内容，但保留子块的相似度分数
                    return Document.builder()
                        .id(parentId)
                        .content(parentChunk.get().getContent())
                        .metadata(parentChunk.get().getMetadata())
                        .score(childDoc.getScore())
                        .build();
                } else {
                    return childDoc;  // 父块找不到，降级用子块
                }
            })
            .collect(Collectors.toList());
    }
    
    private String generateId(String docId, int chunkIndex) {
        return docId + "_chunk_" + chunkIndex;
    }
}

六、预处理质量监控

文档预处理不是一次性的，需要持续监控质量：

@Component
public class ChunkingQualityMonitor {
    
    @Autowired
    private MetricsService metrics;
    
    /**
     * 分块质量指标采集
     */
    public void recordChunkingMetrics(String docId, List<Chunk> chunks, 
                                       String originalContent) {
        // 分块大小分布
        DoubleSummaryStatistics sizeStats = chunks.stream()
            .mapToDouble(c -> c.getContent().length())
            .summaryStatistics();
        
        metrics.gauge("chunk_size_avg", sizeStats.getAverage());
        metrics.gauge("chunk_size_min", sizeStats.getMin());
        metrics.gauge("chunk_size_max", sizeStats.getMax());
        
        // 压缩比
        double compressionRatio = (double) chunks.stream()
            .mapToInt(c -> c.getContent().length()).sum() / originalContent.length();
        metrics.gauge("chunking_ratio", compressionRatio);
        
        // 检测可疑的过小块（可能是清洗/分块问题）
        long tinyChunks = chunks.stream()
            .filter(c -> c.getContent().length() < 50)
            .count();
        
        if (tinyChunks > chunks.size() * 0.1) {
            log.warn("文档 {} 存在大量过小块：{}/{}，请检查分块策略",
                docId, tinyChunks, chunks.size());
            metrics.increment("suspicious_tiny_chunks");
        }
        
        // 检测过大块（可能分块器没有正常工作）
        long hugeChunks = chunks.stream()
            .filter(c -> c.getContent().length() > 3000)
            .count();
        
        if (hugeChunks > 0) {
            log.warn("文档 {} 存在过大块：{} 个超过3000字符", docId, hugeChunks);
            metrics.increment("suspicious_huge_chunks");
        }
    }
}

七、实际经验总结

做了这么多项目，关于文档预处理我的几点核心观点：

分块大小没有银弹。通用经验是：问答类场景用小块（300-500字），摘要类场景用大块（800-1500字），技术文档用父子双层结构。但最终还是要根据你的文档类型和查询模式来调。

清洗规则要写成可配置的。不同来源的文档有不同的噪声模式（PDF的页眉页脚、Word的隐藏字符、Confluence的HTML标签残留），清洗规则要能按来源类型配置，不能hardcode一套规则打天下。

元数据是被严重低估的工程资产。很多团队索引文档时只存了内容，没有时间、权威性、部门等元数据，后来想加过滤功能，整个索引要重建。设计之初就把元数据Schema想清楚，成本最低。

父子分块值得投入。实现起来确实比普通分块复杂一些，但对检索质量的提升是显著的，特别是对于长文档。我们的实测数据：相同查询在普通分块上的准确率67%，换成父子分块后75%，8个百分点的提升来自纯工程优化。

下一篇讲向量数据库的选型对比，几个主流数据库的核心差异点在哪里，选型时应该关注哪些指标。