第2198篇：多模态RAG系统设计——文字和图片混合知识库的检索架构

老张大约 8 分钟

第2198篇：多模态RAG系统设计——文字和图片混合知识库的检索架构

适读人群：有RAG开发经验、想扩展到图文混合检索的Java工程师 | 阅读时长：约18分钟 | 核心价值：多模态RAG的完整架构设计，解决图文混合知识库的检索难题

我们在给一家制造企业做技术知识库。他们的文档库里有几万份资料：技术手册、故障处理指南、维修案例——但这些文档里大量依赖图片，电路图、故障照片、零件对比图，光靠文字描述根本表达不清楚。

最开始我们用的是普通RAG：文档入库时提取文字，向量化后存入向量数据库，查询时用文字相似度检索。系统跑了两个月，工程师的反馈是：问具体故障的时候，文字描述得到的答案没问题，但如果把故障图片拍进来问"这是什么问题"，系统完全不知道该去找什么。

这是普通RAG的根本局限——只有文字向量，没有视觉向量。

多模态RAG要解决的核心问题是：如何把文字知识和图片知识存在同一个可检索的空间里，并且支持文字查文字、图片查图片、文字查图片、图片查文字四种检索模式。

一、多模态RAG的架构选型

多模态RAG有三种主流架构，各有适用场景：

架构一：晚期融合（Late Fusion）

文字和图片分别用各自的模型向量化，存在不同索引，检索时分别检索后合并结果：

文字查询 → 文字向量模型 → 文字索引检索 ─┐
                                        → 结果融合 → LLM生成答案
图片查询 → 图片向量模型 → 图片索引检索 ─┘

优点：实现简单，文字和图片各自用最优模型缺点：无法做真正的跨模态检索（用文字找图片）

架构二：统一向量空间（Unified Embedding）

使用CLIP等多模态Embedding模型，把文字和图片映射到同一个向量空间：

文字/图片 → CLIP Encoder → 统一向量空间 → 混合索引检索

优点：支持跨模态检索，架构简洁缺点：CLIP类模型的文字理解能力比专门的文字Embedding模型弱

架构三：文字主导+图片附加（Text-Anchored）

对图片生成文字描述（用VLM），以文字为主要索引，图片作为附加信息存储：

图片 → VLM描述生成 → 文字向量 → 文字索引
查询时 → 文字检索 → 返回[文字+对应图片] → VLM生成含图答案

优点：文字检索质量好，实现相对简单缺点：图片信息依赖VLM描述的质量，可能丢失细节

我们的选择：混合架构

根据实际业务需求，我们用了架构二+架构三的混合：

文字内容用BGE等文字Embedding模型索引（高精度）
图片同时生成CLIP向量（支持跨模态）和VLM文字描述（支持文字检索）
检索时根据查询类型路由到不同索引

二、多模态知识库的文档处理流水线

@Service
public class MultimodalDocumentIndexer {
    
    private final DocumentParser documentParser;
    private final VisionService visionService;
    private final EmbeddingClient textEmbeddingClient;
    private final CLIPEmbeddingService clipEmbeddingService;
    private final MilvusService milvusService;
    private final ObjectStorageService ossService;
    
    /**
     * 处理并索引一份文档
     */
    public IndexingResult indexDocument(byte[] documentBytes, String fileName, 
                                         String documentId) {
        DocumentParseResult parseResult = documentParser.parse(documentBytes, fileName);
        
        int textChunksIndexed = 0;
        int imagesIndexed = 0;
        
        // 处理文字块
        for (TextChunk chunk : parseResult.getTextChunks()) {
            indexTextChunk(chunk, documentId);
            textChunksIndexed++;
        }
        
        // 处理图片块
        for (ImageChunk imageChunk : parseResult.getImageChunks()) {
            indexImageChunk(imageChunk, documentId);
            imagesIndexed++;
        }
        
        return new IndexingResult(documentId, textChunksIndexed, imagesIndexed);
    }
    
    private void indexTextChunk(TextChunk chunk, String documentId) {
        // 生成文字Embedding
        float[] embedding = textEmbeddingClient.embed(chunk.getText());
        
        // 存入Milvus文字集合
        milvusService.insert("text_chunks", Map.of(
            "id", UUID.randomUUID().toString(),
            "document_id", documentId,
            "chunk_text", chunk.getText(),
            "page_number", chunk.getPageNumber(),
            "embedding", embedding,
            "content_type", "text"
        ));
    }
    
    private void indexImageChunk(ImageChunk imageChunk, String documentId) {
        byte[] imageBytes = imageChunk.getImageBytes();
        String imageId = UUID.randomUUID().toString();
        
        // 1. 上传图片到对象存储
        String imageUrl = ossService.upload(imageId + ".jpg", imageBytes);
        
        // 2. 用VLM生成图片的文字描述
        String imageDescription = generateImageDescription(imageBytes, imageChunk);
        
        // 3. 生成文字描述的Embedding（用于文字检索图片）
        float[] textEmbedding = textEmbeddingClient.embed(imageDescription);
        
        // 4. 生成CLIP图片Embedding（用于图片检索图片）
        float[] clipEmbedding = clipEmbeddingService.embedImage(imageBytes);
        
        // 5. 存入Milvus图片集合
        milvusService.insert("image_chunks", Map.of(
            "id", imageId,
            "document_id", documentId,
            "image_url", imageUrl,
            "description", imageDescription,
            "page_number", imageChunk.getPageNumber(),
            "text_embedding", textEmbedding,
            "clip_embedding", clipEmbedding,
            "content_type", "image"
        ));
    }
    
    private String generateImageDescription(byte[] imageBytes, ImageChunk imageChunk) {
        // 构建包含上下文的描述提示
        String contextHint = imageChunk.getSurroundingText() != null
            ? "\n上下文：" + imageChunk.getSurroundingText()
            : "";
        
        VisionRequest request = VisionRequest.builder()
            .images(List.of(ImageInput.fromBytes(imageBytes, "image/jpeg")))
            .prompt("""
                请为这张图片生成一段详细的技术描述，用于知识库检索。
                描述应该：
                1. 描述图片中的主要内容和技术要素
                2. 包含图中可见的文字、数字、标签
                3. 描述图片展示的技术原理或故障特征
                4. 使用专业术语
                5. 100-200字左右
                """ + contextHint)
            .build();
        
        return visionService.analyzeImage(request).getContent();
    }
}

三、多模态检索引擎

检索是多模态RAG最复杂的部分，需要处理不同类型的查询：

@Service
public class MultimodalRetriever {
    
    private final EmbeddingClient textEmbeddingClient;
    private final CLIPEmbeddingService clipEmbeddingService;
    private final MilvusService milvusService;
    private final VisionService visionService;
    
    /**
     * 统一检索接口：自动识别查询类型并路由
     */
    public List<RetrievalResult> retrieve(MultimodalQuery query, int topK) {
        List<RetrievalResult> results = new ArrayList<>();
        
        if (query.hasText()) {
            // 文字查询：检索文字块 + 通过描述检索图片
            results.addAll(retrieveByText(query.getText(), topK));
        }
        
        if (query.hasImage()) {
            // 图片查询：检索相似图片 + 通过CLIP检索文字
            results.addAll(retrieveByImage(query.getImageBytes(), topK));
        }
        
        // 去重合并，按相关度排序
        return mergeAndRank(results, topK);
    }
    
    private List<RetrievalResult> retrieveByText(String text, int topK) {
        float[] textEmbedding = textEmbeddingClient.embed(text);
        List<RetrievalResult> results = new ArrayList<>();
        
        // 检索文字块
        List<Map<String, Object>> textResults = milvusService.search(
            "text_chunks", "embedding", textEmbedding, topK,
            Map.of("content_type", "text"));
        
        textResults.forEach(r -> results.add(RetrievalResult.fromTextChunk(r)));
        
        // 通过图片描述Embedding检索图片
        List<Map<String, Object>> imageResults = milvusService.search(
            "image_chunks", "text_embedding", textEmbedding, topK / 2,
            Map.of("content_type", "image"));
        
        imageResults.forEach(r -> results.add(RetrievalResult.fromImageChunk(r)));
        
        return results;
    }
    
    private List<RetrievalResult> retrieveByImage(byte[] imageBytes, int topK) {
        float[] clipEmbedding = clipEmbeddingService.embedImage(imageBytes);
        List<RetrievalResult> results = new ArrayList<>();
        
        // CLIP图片相似检索
        List<Map<String, Object>> imageResults = milvusService.search(
            "image_chunks", "clip_embedding", clipEmbedding, topK,
            Map.of("content_type", "image"));
        
        imageResults.forEach(r -> results.add(RetrievalResult.fromImageChunk(r)));
        
        return results;
    }
    
    private List<RetrievalResult> mergeAndRank(List<RetrievalResult> results, int topK) {
        // 按文档ID去重（同一文档的结果只保留最高分）
        Map<String, RetrievalResult> deduped = new LinkedHashMap<>();
        for (RetrievalResult result : results) {
            String key = result.getDocumentId() + "_" + result.getPageNumber();
            deduped.merge(key, result, (a, b) -> a.getScore() >= b.getScore() ? a : b);
        }
        
        return deduped.values().stream()
            .sorted(Comparator.comparingDouble(RetrievalResult::getScore).reversed())
            .limit(topK)
            .collect(Collectors.toList());
    }
}

四、多模态答案生成

检索到文字和图片结果后，需要把它们一起送给VLM生成答案：

@Service
public class MultimodalRAGChain {
    
    private final MultimodalRetriever retriever;
    private final VisionService visionService;
    private final ObjectStorageService ossService;
    
    public RAGResponse query(MultimodalQuery userQuery) {
        // 1. 多模态检索
        List<RetrievalResult> retrievals = retriever.retrieve(userQuery, 10);
        
        // 2. 分离文字上下文和图片
        List<String> textContexts = retrievals.stream()
            .filter(r -> r.getType() == ContentType.TEXT)
            .map(r -> r.getContent())
            .collect(Collectors.toList());
        
        List<byte[]> contextImages = retrievals.stream()
            .filter(r -> r.getType() == ContentType.IMAGE)
            .map(r -> ossService.download(r.getImageUrl()))
            .collect(Collectors.toList());
        
        // 3. 构建多模态查询请求
        String systemPrompt = """
            你是一个技术知识库助手。根据提供的文字上下文和图片，回答用户的技术问题。
            如果引用了图片中的内容，请明确指出"如图所示"。
            如果检索到的内容不足以回答问题，请如实说明。
            """;
        
        StringBuilder contextBuilder = new StringBuilder("参考资料：\n\n");
        for (int i = 0; i < textContexts.size(); i++) {
            contextBuilder.append("[文档").append(i + 1).append("]\n");
            contextBuilder.append(textContexts.get(i)).append("\n\n");
        }
        
        String fullPrompt = contextBuilder.toString() + "\n用户问题：" + userQuery.getText();
        
        // 4. 构建含图片的请求
        List<ImageInput> allImages = new ArrayList<>();
        if (userQuery.hasImage()) {
            allImages.add(ImageInput.fromBytes(userQuery.getImageBytes(), "image/jpeg"));
        }
        contextImages.stream()
            .map(img -> ImageInput.fromBytes(img, "image/jpeg"))
            .forEach(allImages::add);
        
        VisionRequest request = VisionRequest.builder()
            .images(allImages)
            .systemPrompt(systemPrompt)
            .prompt(fullPrompt)
            .maxTokens(2000)
            .build();
        
        VisionResponse response = visionService.analyzeImage(request);
        
        return new RAGResponse(
            response.getContent(),
            retrievals,
            response.getPromptTokens(),
            response.getCompletionTokens()
        );
    }
}

五、多模态RAG的工程挑战与应对

挑战1：索引成本

每张图片需要调用VLM生成描述（约0.01美元/张）。一万张图片就是100美元。解决方案：批量处理时用便宜的模型（如gemini-flash）生成描述，只在必要时用昂贵模型。

挑战2：图片描述质量

VLM生成的描述质量直接影响检索效果。建议做描述质量评估：太短（<50字）或太泛化的描述，标记出来人工优化。

挑战3：向量空间的语义对齐

用不同模型生成的向量（BGE for text, CLIP for image）存在语义空间不对齐的问题。混合检索时，不同来源的相似度分数不可直接比较，需要分别归一化后再合并。

挑战4：图片的时效性

知识库更新时，需要重新生成图片的描述和向量。建议在索引时记录图片的内容哈希，更新时只对内容变化的图片重新处理。

多模态RAG是一个值得深入的工程方向，这套架构在制造业场景里跑了半年，总体效果比纯文字RAG提升了35%的问答准确率——特别是对于"这个图里是什么零件坏了"这类视觉相关的问题，从几乎答不上到正确率85%。