第1809篇：多模态RAG——图文混合检索的索引设计与检索融合

老张大约 13 分钟

第1809篇：多模态RAG——图文混合检索的索引设计与检索融合

为什么纯文本RAG不够用

做过RAG系统的人都知道，最基础的版本是：文本分块 -> 向量化 -> 存向量数据库 -> 检索时余弦相似度 -> 拼接上下文 -> 问LLM。

这套流程跑通了之后，很多人以为RAG基本搞完了。

然后就遇到了这类问题："我们的产品手册里有很多流程图，用户问流程的问题，RAG完全找不到答案。"

或者："用户上传的报告里，数据都在图表里，文字只是说'如图所示'，RAG根本不知道图里是什么。"

这就是多模态RAG要解决的问题：当知识库里有图、表、截图等视觉内容时，怎么建索引，怎么检索，怎么把检索到的内容用起来。

多模态RAG的核心挑战

跟纯文本RAG相比，多模态RAG有几个额外的技术挑战：

每个挑战都需要专门的工程设计。

索引设计：三种策略

策略一：文本替代法（最简单，效果有限）

把图像内容转成文字描述，和文字内容一起建文本索引：

@Component
public class ImageToTextIndexer {
    
    @Autowired
    private VisionModelClient visionClient;
    
    /**
     * 图像转文本描述，加入文本索引
     * 适用场景：图像内容能完整用文字描述的情况
     * 局限：细节信息可能丢失，特别是精确数字、复杂图表
     */
    public TextChunk imageToIndexableText(ImageContent image, String surroundingContext) {
        String prompt = String.format("""
            请详细描述这张图片的内容，目的是建立搜索索引。
            
            要求：
            1. 描述要详细，包含所有可见的文字、数字、关键内容
            2. 如果是图表，描述图表类型、轴标签、关键数据点
            3. 如果是流程图/架构图，描述节点和连接关系
            4. 如果是截图，描述显示的内容
            5. 用陈述性语言，不要用"图中显示"等引导词
            
            上下文（图像附近的文字）：%s
            
            请只输出描述文字，不要前缀。
            """, surroundingContext);
        
        String description = visionClient.analyze(image.getImageData(), prompt);
        
        return TextChunk.builder()
            .text(description)
            .sourceType("IMAGE_DESCRIPTION")
            .sourceId(image.getImageId())
            .metadata(Map.of(
                "original_image_id", image.getImageId(),
                "image_page", String.valueOf(image.getPageNumber()),
                "content_type", image.getContentType().name()
            ))
            .build();
    }
}

策略二：双通道索引法（推荐）

文本和图像分别建索引，检索时并行搜索两个通道：

@Component
public class DualChannelIndexer {
    
    @Autowired
    private EmbeddingService textEmbeddingService;
    
    @Autowired
    private ClipEmbeddingService clipService; // 支持图文跨模态的CLIP模型
    
    @Autowired
    private VectorDatabase textVectorDB;
    
    @Autowired
    private VectorDatabase imageVectorDB;
    
    /**
     * 双通道索引建立
     * 通道1：文本内容 -> 文本向量
     * 通道2：图像内容 -> CLIP向量（支持用文本查图像）
     */
    public IndexingResult indexDocument(ProcessedDocument document) {
        int textChunksIndexed = 0;
        int imageChunksIndexed = 0;
        
        // 索引文本块
        for (TextChunk chunk : document.getTextChunks()) {
            float[] textVector = textEmbeddingService.embed(chunk.getText());
            
            textVectorDB.insert(VectorEntry.builder()
                .id(UUID.randomUUID().toString())
                .vector(textVector)
                .metadata(Map.of(
                    "doc_id", document.getDocumentId(),
                    "chunk_id", chunk.getChunkId(),
                    "content_type", "TEXT",
                    "text", chunk.getText(),
                    "page", String.valueOf(chunk.getPageNumber())
                ))
                .build());
            
            textChunksIndexed++;
        }
        
        // 索引图像块（含CLIP向量 + 文字描述向量双重索引）
        for (ImageContent image : document.getImages()) {
            // CLIP向量：支持"用文字描述搜图"
            float[] clipVector = clipService.encodeImage(image.getImageData());
            
            // 同时生成图像描述，用文本向量作为补充索引
            String imageDescription = generateImageDescription(image, document);
            float[] descriptionVector = textEmbeddingService.embed(imageDescription);
            
            // 存CLIP索引
            imageVectorDB.insert(VectorEntry.builder()
                .id(UUID.randomUUID().toString())
                .vector(clipVector)
                .metadata(Map.of(
                    "doc_id", document.getDocumentId(),
                    "image_id", image.getImageId(),
                    "content_type", image.getContentType().name(),
                    "description", imageDescription,
                    "page", String.valueOf(image.getPageNumber()),
                    "storage_path", image.getStoragePath()
                ))
                .build());
            
            // 把图像描述也放到文本索引里（增强文本检索能力）
            textVectorDB.insert(VectorEntry.builder()
                .id(UUID.randomUUID().toString())
                .vector(descriptionVector)
                .metadata(Map.of(
                    "doc_id", document.getDocumentId(),
                    "image_id", image.getImageId(),
                    "content_type", "IMAGE_DESCRIPTION",
                    "text", imageDescription,
                    "page", String.valueOf(image.getPageNumber()),
                    "storage_path", image.getStoragePath()
                ))
                .build());
            
            imageChunksIndexed++;
        }
        
        return IndexingResult.builder()
            .textChunksIndexed(textChunksIndexed)
            .imageChunksIndexed(imageChunksIndexed)
            .build();
    }
    
    /**
     * 生成适合索引的图像描述
     * 注意：这个描述要面向检索优化，不是给人看的
     */
    private String generateImageDescription(ImageContent image, ProcessedDocument doc) {
        // 找到图像附近的文字上下文
        String nearbyText = doc.getNearbyText(image.getPageNumber(), 300);
        
        String prompt = String.format("""
            为以下图像生成搜索索引描述。
            
            附近的文字上下文：
            %s
            
            要求：
            1. 描述要包含可能被用来搜索这张图的关键词
            2. 图表类型、数据范围、主要趋势
            3. 图中的文字标签、轴名称、图例
            4. 如果是流程图，包含步骤名称
            5. 语言简洁，关键词密度高
            6. 200字以内
            """, nearbyText);
        
        return visionClient.analyze(image.getImageData(), prompt);
    }
}

策略三：上下文绑定索引（高质量，高成本）

把图像和周围的文字作为一个整体单元索引：

@Component
public class ContextBoundIndexer {
    
    /**
     * 上下文绑定索引：将图像和周围文字打包成一个语义单元
     * 好处：检索时直接得到图文组合，上下文完整
     * 代价：建索引时成本更高
     */
    public List<MultimodalChunk> buildContextBoundChunks(ProcessedDocument document) {
        List<MultimodalChunk> chunks = new ArrayList<>();
        
        for (ContentBlock block : document.getBlocks()) {
            if (block.getType() == BlockType.IMAGE) {
                // 找到这张图前后的文字上下文
                String preContext = document.getTextBefore(block.getBlockId(), 200);
                String postContext = document.getTextAfter(block.getBlockId(), 200);
                
                // 分析图像内容
                ImageContent image = block.getImageContent();
                String imageAnalysis = analyzeImage(image);
                
                // 组装成多模态块
                String combinedText = String.format("""
                    [图表前文字]：%s
                    [图表内容]：%s
                    [图表后文字]：%s
                    """, preContext, imageAnalysis, postContext);
                
                chunks.add(MultimodalChunk.builder()
                    .chunkId(UUID.randomUUID().toString())
                    .textContent(combinedText)
                    .imageContent(image)
                    .hasImage(true)
                    .pageNumber(block.getPageNumber())
                    .chunkType("IMAGE_WITH_CONTEXT")
                    .build());
            }
        }
        
        return chunks;
    }
}

检索层：多路检索融合

有了多种索引，检索时需要并行查多个通道，然后融合结果：

@Component
public class MultimodalRetriever {
    
    @Autowired
    private EmbeddingService textEmbeddingService;
    
    @Autowired
    private ClipEmbeddingService clipService;
    
    @Autowired
    private VectorDatabase textVectorDB;
    
    @Autowired
    private VectorDatabase imageVectorDB;
    
    /**
     * 多路检索：并行搜索文本和图像通道，融合结果
     */
    public List<RetrievedChunk> retrieve(String query, RetrievalConfig config) {
        
        // 并行执行多路检索
        CompletableFuture<List<ScoredEntry>> textFuture = 
            CompletableFuture.supplyAsync(() -> retrieveText(query, config));
        
        CompletableFuture<List<ScoredEntry>> imageByTextFuture = 
            CompletableFuture.supplyAsync(() -> retrieveImageByText(query, config));
        
        // 如果查询中包含图像，还可以用图像查图像（跨文档搜相似图）
        // CompletableFuture<List<ScoredEntry>> imageByImageFuture = ...
        
        // 等待所有检索完成
        CompletableFuture.allOf(textFuture, imageByTextFuture).join();
        
        List<ScoredEntry> textResults = textFuture.join();
        List<ScoredEntry> imageResults = imageByTextFuture.join();
        
        // 融合：RRF（Reciprocal Rank Fusion）算法
        List<ScoredEntry> fusedResults = reciprocalRankFusion(
            Arrays.asList(textResults, imageResults), config.getTopK());
        
        // 转换为RetrievedChunk
        return convertToChunks(fusedResults);
    }
    
    private List<ScoredEntry> retrieveText(String query, RetrievalConfig config) {
        float[] queryVector = textEmbeddingService.embed(query);
        return textVectorDB.search(queryVector, config.getTopK() * 2,
            config.getMinTextScore());
    }
    
    private List<ScoredEntry> retrieveImageByText(String query, RetrievalConfig config) {
        // 用CLIP把文本查询向量化，然后在图像CLIP向量库里搜
        float[] clipQueryVector = clipService.encodeText(query);
        return imageVectorDB.search(clipQueryVector, config.getTopK() * 2,
            config.getMinImageScore());
    }
    
    /**
     * RRF融合算法：把多路排序结果融合成一个统一排序
     * 
     * RRF(d) = sum(1 / (k + rank_i(d)))
     * k通常取60，是平滑参数
     * 
     * 优点：不需要对齐不同通道的分数尺度（这是简单加权融合的最大问题）
     */
    private List<ScoredEntry> reciprocalRankFusion(
            List<List<ScoredEntry>> rankings, int topK) {
        
        Map<String, Double> rrfScores = new HashMap<>();
        Map<String, ScoredEntry> entryMap = new HashMap<>();
        
        int k = 60; // RRF的平滑参数
        
        for (List<ScoredEntry> ranking : rankings) {
            for (int rank = 0; rank < ranking.size(); rank++) {
                ScoredEntry entry = ranking.get(rank);
                String entryId = entry.getId();
                
                double rrfScore = 1.0 / (k + rank + 1);
                rrfScores.merge(entryId, rrfScore, Double::sum);
                entryMap.put(entryId, entry);
            }
        }
        
        // 按RRF分数排序
        return rrfScores.entrySet().stream()
            .sorted(Map.Entry.<String, Double>comparingByValue().reversed())
            .limit(topK)
            .map(e -> {
                ScoredEntry entry = entryMap.get(e.getKey());
                entry.setFinalScore(e.getValue());
                return entry;
            })
            .collect(Collectors.toList());
    }
    
    /**
     * 混合检索中的一个细节：
     * 检索到的图像描述文字块，如果来自同一张图，要合并（不要重复）
     */
    private List<RetrievedChunk> convertToChunks(List<ScoredEntry> entries) {
        Set<String> processedImageIds = new HashSet<>();
        List<RetrievedChunk> chunks = new ArrayList<>();
        
        for (ScoredEntry entry : entries) {
            String contentType = entry.getMetadata().getOrDefault("content_type", "TEXT");
            String imageId = entry.getMetadata().get("image_id");
            
            // 避免同一张图的多个索引（CLIP向量和描述向量）都被加进去
            if (imageId != null && processedImageIds.contains(imageId)) {
                continue;
            }
            
            if (imageId != null) {
                processedImageIds.add(imageId);
                
                // 加载图像数据
                byte[] imageData = imageStorage.load(
                    entry.getMetadata().get("storage_path"));
                
                chunks.add(RetrievedChunk.builder()
                    .chunkId(entry.getId())
                    .contentType(RetrievedContentType.IMAGE)
                    .text(entry.getMetadata().get("description"))
                    .imageData(imageData)
                    .score(entry.getFinalScore())
                    .pageNumber(Integer.parseInt(
                        entry.getMetadata().getOrDefault("page", "0")))
                    .build());
            } else {
                chunks.add(RetrievedChunk.builder()
                    .chunkId(entry.getId())
                    .contentType(RetrievedContentType.TEXT)
                    .text(entry.getMetadata().get("text"))
                    .score(entry.getFinalScore())
                    .pageNumber(Integer.parseInt(
                        entry.getMetadata().getOrDefault("page", "0")))
                    .build());
            }
        }
        
        return chunks;
    }
}

上下文组装：图文混合上下文

检索到了图文混合的内容，怎么组装成发给LLM的上下文，这是另一个工程难题：

@Component
public class MultimodalContextBuilder {
    
    @Autowired
    private VisionModelClient visionClient;
    
    /**
     * 构建多模态上下文
     * 策略：文本直接拼接，图像先转文字描述再拼接
     * 更好的策略：使用支持多图输入的LLM，把图像直接传进去
     */
    public MultimodalContext buildContext(String query, 
                                           List<RetrievedChunk> chunks,
                                           ContextConfig config) {
        
        // 按页码排序，让上下文更连贯
        chunks.sort(Comparator.comparingInt(RetrievedChunk::getPageNumber));
        
        StringBuilder textContext = new StringBuilder();
        List<byte[]> imageAttachments = new ArrayList<>();
        List<ImageReference> imageReferences = new ArrayList<>();
        
        int imageCounter = 0;
        
        for (RetrievedChunk chunk : chunks) {
            
            if (chunk.getContentType() == RetrievedContentType.TEXT) {
                textContext.append(String.format("\n[来源：第%d页]\n%s\n", 
                    chunk.getPageNumber(), chunk.getText()));
                
            } else if (chunk.getContentType() == RetrievedContentType.IMAGE) {
                imageCounter++;
                
                if (config.isDirectVisionEnabled()) {
                    // 方式1：直接把图像加入多模态请求（推荐，需要支持多图的模型）
                    imageAttachments.add(chunk.getImageData());
                    textContext.append(String.format("\n[图%d：第%d页的图表/图像]\n", 
                        imageCounter, chunk.getPageNumber()));
                } else {
                    // 方式2：图像转文字（降级方案）
                    String imageText = describeImageForContext(chunk, query);
                    textContext.append(String.format("\n[图%d（第%d页图表分析）]\n%s\n", 
                        imageCounter, chunk.getPageNumber(), imageText));
                }
                
                imageReferences.add(ImageReference.builder()
                    .imageNumber(imageCounter)
                    .pageNumber(chunk.getPageNumber())
                    .imageId(chunk.getChunkId())
                    .build());
            }
        }
        
        return MultimodalContext.builder()
            .textContent(textContext.toString())
            .imageAttachments(imageAttachments)
            .imageReferences(imageReferences)
            .query(query)
            .build();
    }
    
    /**
     * 图像转文字（为了回答特定查询）
     * 注意：这里的描述是针对查询优化的，不是通用描述
     */
    private String describeImageForContext(RetrievedChunk imageChunk, String query) {
        String prompt = String.format("""
            用户的问题是：%s
            
            请根据这张图片，提取与用户问题相关的信息。
            
            要求：
            1. 重点回答与问题相关的内容
            2. 精确提取数字和关键词
            3. 如果图中没有相关信息，说明图中实际包含的内容
            4. 200字以内
            """, query);
        
        return visionClient.analyze(imageChunk.getImageData(), prompt);
    }
    
    /**
     * 构建发给LLM的最终Prompt
     */
    public String buildFinalPrompt(MultimodalContext context) {
        StringBuilder prompt = new StringBuilder();
        
        prompt.append("你是一个知识库问答助手。根据以下检索到的内容，回答用户的问题。\n\n");
        
        prompt.append("【检索到的相关内容】：\n");
        prompt.append(context.getTextContent());
        prompt.append("\n\n");
        
        if (!context.getImageReferences().isEmpty()) {
            prompt.append("注意：上方内容中标记了[图X]的地方，对应着已附上的图像。\n");
        }
        
        prompt.append("【用户问题】：\n");
        prompt.append(context.getQuery());
        prompt.append("\n\n");
        
        prompt.append("请基于以上内容给出准确的回答。如果内容中有数据图表，请引用具体数据。");
        prompt.append("如果检索内容不足以回答问题，请明确说明。");
        
        return prompt.toString();
    }
}

查询路由：不同查询类型走不同策略

不是所有查询都需要多模态检索，要根据查询内容智能路由：

@Component
public class QueryRouter {
    
    @Autowired
    private LlmClient llmClient;
    
    /**
     * 查询类型分析，决定检索策略
     */
    public QueryRouteDecision route(String query) {
        
        QueryType queryType = classifyQuery(query);
        
        return switch (queryType) {
            case FACTUAL_TEXT -> QueryRouteDecision.builder()
                .strategy(RetrievalStrategy.TEXT_ONLY)
                .explanation("纯文字事实性查询，不需要图像检索")
                .build();
                
            case VISUAL_DATA -> QueryRouteDecision.builder()
                .strategy(RetrievalStrategy.IMAGE_PRIORITY)
                .explanation("查询涉及图表数据，优先搜索图像")
                .build();
                
            case PROCESS_FLOW -> QueryRouteDecision.builder()
                .strategy(RetrievalStrategy.DIAGRAM_PRIORITY)
                .explanation("查询涉及流程，优先搜索流程图")
                .build();
                
            case MIXED -> QueryRouteDecision.builder()
                .strategy(RetrievalStrategy.DUAL_CHANNEL)
                .explanation("综合性查询，同时搜索文本和图像")
                .build();
                
            default -> QueryRouteDecision.builder()
                .strategy(RetrievalStrategy.TEXT_ONLY)
                .build();
        };
    }
    
    private QueryType classifyQuery(String query) {
        String lowerQuery = query.toLowerCase();
        
        // 明显的视觉查询关键词
        List<String> visualKeywords = Arrays.asList(
            "图表", "折线图", "柱状图", "饼图", "趋势", "数据图",
            "图片显示", "看图", "图中"
        );
        
        for (String keyword : visualKeywords) {
            if (lowerQuery.contains(keyword)) return QueryType.VISUAL_DATA;
        }
        
        // 流程相关关键词
        List<String> processKeywords = Arrays.asList(
            "流程", "步骤", "怎么操作", "如何做", "流程图"
        );
        
        for (String keyword : processKeywords) {
            if (lowerQuery.contains(keyword)) return QueryType.PROCESS_FLOW;
        }
        
        // 数字/统计查询（可能需要图表）
        if (query.matches(".*[\\d%].*|.*多少.*|.*比例.*|.*增长.*")) {
            return QueryType.MIXED;
        }
        
        return QueryType.FACTUAL_TEXT;
    }
}

图像引用追踪：让回答可以溯源

多模态RAG的一个重要功能：回答中引用的图像要能定位到原始来源：

@Component
public class AnswerCitationTracker {
    
    /**
     * 在LLM的回答中，追踪哪些内容来自哪个图像
     * 这样用户能点击"查看原图"
     */
    public AnswerWithCitations trackCitations(String answer, 
                                                MultimodalContext context,
                                                List<RetrievedChunk> retrievedChunks) {
        
        // 让LLM标注引用来源
        String citationPrompt = String.format("""
            以下是对用户问题的回答，以及参考的图像列表。
            请在回答中，对每处引用了图像内容的地方，在句末添加引用标记 [图X]。
            
            回答：
            %s
            
            图像列表：
            %s
            
            要求：
            1. 只在确实引用了图像内容的地方添加引用标记
            2. 如果一句话同时引用了多张图，都要标注
            3. 不要修改原回答的文字内容
            
            返回打了引用标记的回答。
            """, 
            answer,
            context.getImageReferences().stream()
                .map(ref -> String.format("图%d：第%d页", 
                    ref.getImageNumber(), ref.getPageNumber()))
                .collect(Collectors.joining("\n")));
        
        String answerwithCitations = llmClient.complete(citationPrompt);
        
        // 解析引用标记，建立引用关系
        Map<Integer, ImageReference> citations = new HashMap<>();
        for (ImageReference ref : context.getImageReferences()) {
            citations.put(ref.getImageNumber(), ref);
        }
        
        return AnswerWithCitations.builder()
            .answer(answerwithCitations)
            .citations(citations)
            .build();
    }
}

一个完整的工作流

把所有组件串起来：

@Service
public class MultimodalRagService {
    
    @Autowired
    private QueryRouter queryRouter;
    
    @Autowired
    private MultimodalRetriever retriever;
    
    @Autowired
    private MultimodalContextBuilder contextBuilder;
    
    @Autowired
    private VisionModelClient visionClient;
    
    @Autowired
    private AnswerCitationTracker citationTracker;
    
    /**
     * 完整的多模态RAG查询流程
     */
    public MultimodalRagAnswer query(String userQuery) {
        
        // Step1: 查询路由
        QueryRouteDecision routeDecision = queryRouter.route(userQuery);
        
        // Step2: 检索
        RetrievalConfig config = buildRetrievalConfig(routeDecision);
        List<RetrievedChunk> chunks = retriever.retrieve(userQuery, config);
        
        if (chunks.isEmpty()) {
            return MultimodalRagAnswer.noResults("未找到相关内容");
        }
        
        // Step3: 上下文组装
        MultimodalContext context = contextBuilder.buildContext(userQuery, chunks, 
            ContextConfig.builder()
                .directVisionEnabled(true)
                .maxTextLength(3000)
                .maxImages(5)
                .build());
        
        // Step4: 调用LLM生成答案
        String answer;
        
        if (!context.getImageAttachments().isEmpty()) {
            // 有图像，用多模态调用
            String textPrompt = contextBuilder.buildFinalPrompt(context);
            answer = visionClient.analyzeWithText(
                context.getImageAttachments(), textPrompt);
        } else {
            // 纯文本
            String textPrompt = contextBuilder.buildFinalPrompt(context);
            answer = llmClient.complete(textPrompt);
        }
        
        // Step5: 追踪引用
        AnswerWithCitations citedAnswer = citationTracker.trackCitations(
            answer, context, chunks);
        
        return MultimodalRagAnswer.builder()
            .answer(citedAnswer.getAnswer())
            .citations(citedAnswer.getCitations())
            .sourceChunks(chunks)
            .retrievalStrategy(routeDecision.getStrategy().name())
            .build();
    }
}

踩坑与经验

坑1：CLIP模型的语言偏差

CLIP模型（包括OpenAI原版）主要在英文数据上训练，中文查询效果明显差于英文。检索中文图表时，用中文描述向量（text embedding）比用CLIP更准。可以做fallback：先用CLIP检索，如果没找到相关内容，切到"图像描述的文本向量"检索。

坑2：图像描述质量不稳定

同样一张图，不同时候VLM生成的描述可能有差异，这会影响索引质量。解决方案：生成描述时多生成几次，取最长最详细的一个；或者用温度0让输出更确定性。

坑3：大型文档的图像索引成本

一个500页的报告里可能有200张图，每张都用VLM生成描述，API成本很高。优化策略：

图像类型先分类，低信息量图（装饰性图片）直接跳过不索引
批量生成描述（一次处理多张图）
缓存图像描述，相同内容的图不重复生成

坑4：检索结果的图文不配套问题

检索到了一张图表，但图表的解释文字没检索到；或者检索到了解释文字，但对应的图没有检索到。解决方案：上下文绑定索引（策略三），把图像和周围的文字打包在一起。

性能对比

三种索引策略在不同场景下的效果对比（基于我们的实测）：

场景	文本替代法	双通道索引	上下文绑定
图表数据查询	中	高	高
流程图查询	低	中	高
混合内容查询	中	高	高
建索引成本	低	中	高
检索延迟	低	中	低（直接取）

结论：如果预算允许，上下文绑定是最好的策略；预算有限，双通道是最佳平衡点。

小结

多模态RAG，核心工程点：

索引要多通道，文本和图像分别建，各有其用
RRF融合胜过简单加权，解决不同通道分数尺度不统一的问题
上下文绑定最准确，但成本高
查询路由省成本，不是所有查询都需要图像检索
溯源很重要，特别是企业场景，答案要能指向原始来源

多模态RAG是一个还在快速发展的领域，随着原生多模态模型越来越强，很多现在的工程处理将来可能会被模型能力直接覆盖。但基本的索引和检索设计思路是不会变的。