Spring AI文档解析：PDF·Word·网页多格式处理完整方案

老张2026/4/30大约 7 分钟

Spring AI文档解析：PDF·Word·网页多格式处理完整方案

适读人群：需要构建RAG知识库、处理企业文档的Java工程师 阅读时长：约16分钟 文章价值：一套代码搞定PDF/Word/网页/Excel，直接用于生产

先说一件真实的事

小李是个做企业知识管理系统的工程师，客户给了他一个需求：把公司十几年积累的文档全部"喂给"AI，让员工可以直接问AI查规章制度。

听起来很简单，但小李打开那个文件夹的时候傻眼了：有1990年代扫描的PDF（全是图片），有Word 97格式的老文档，有嵌了表格的Excel，还有一堆内网HTML页面。格式五花八门，有的中文编码还不对。

他花了两周时间，硬是把各种格式的解析踩了个遍。后来他跟我说："文档解析这事儿，细节是魔鬼。"

今天就把这些细节全部整理出来，让你不用重复踩坑。

Spring AI 文档解析体系全景

Spring AI 1.0 的文档处理围绕 DocumentReader 接口展开，官方内置了主流格式支持，第三方格式通过扩展点接入：

先把依赖加齐：

<!-- Spring AI核心 -->
<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-starter-model-openai</artifactId>
</dependency>

<!-- PDF解析 -->
<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-pdf-document-reader</artifactId>
</dependency>

<!-- Tika（万能格式解析，含Word/Excel/PPT） -->
<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-tika-document-reader</artifactId>
</dependency>

<!-- 向量数据库（示例用PGVector） -->
<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-pgvector-store-spring-boot-autoconfigure</artifactId>
</dependency>

PDF 解析：从普通PDF到扫描件

普通PDF（文字可选中）

@Service
@Slf4j
public class PdfParsingService {

    private final VectorStore vectorStore;
    private final TokenTextSplitter splitter;

    public PdfParsingService(VectorStore vectorStore) {
        this.vectorStore = vectorStore;
        // 每块500 token，重叠50 token
        this.splitter = new TokenTextSplitter(500, 50, 5, 10000, true);
    }

    public void ingestPdf(Resource pdfResource, String sourceId) {
        // 基础配置
        PagePdfDocumentReader reader = new PagePdfDocumentReader(
            pdfResource,
            PdfDocumentReaderConfig.builder()
                .withPageTopMargin(0)
                .withPageBottomMargin(0)
                .withPageExtractedTextFormatter(
                    ExtractedTextFormatter.builder()
                        .withNumberOfBottomTextLinesToDelete(3) // 删除页脚
                        .withNumberOfTopPagesToSkipBeforeDelete(0)
                        .build()
                )
                .withPagesPerDocument(1) // 每页一个Document
                .build()
        );

        List<Document> pages = reader.get();
        log.info("PDF解析完成，共 {} 页，来源: {}", pages.size(), sourceId);

        // 添加来源元数据
        List<Document> enriched = pages.stream()
            .map(doc -> {
                Map<String, Object> meta = new HashMap<>(doc.getMetadata());
                meta.put("source_id", sourceId);
                meta.put("source_type", "pdf");
                meta.put("ingested_at", Instant.now().toString());
                return new Document(doc.getText(), meta);
            })
            .collect(Collectors.toList());

        // 分块 + 存储
        List<Document> chunks = splitter.apply(enriched);
        vectorStore.add(chunks);
        log.info("已存入向量库，共 {} 个chunks", chunks.size());
    }
}

扫描件PDF（需要OCR）

扫描件里全是图片，直接用 PdfDocumentReader 读出来是空的。要先OCR，推荐两个方案：

方案	成本	准确率	速度
Tesseract（本地）	免费	中等（中文需训练）	慢
阿里云/腾讯云OCR	按量付费	高	快
Azure Computer Vision	按量付费	高	快
PaddleOCR（本地）	免费	高（中文强）	中等

@Service
public class ScannedPdfService {

    private final OcrClient ocrClient;  // 封装OCR调用
    private final PdfParsingService pdfParsingService;

    /**
     * 处理扫描件PDF：先转图片，再OCR，再解析
     */
    public void ingestScannedPdf(Resource pdfResource, String sourceId) throws Exception {
        PDDocument pdfDoc = PDDocument.load(pdfResource.getInputStream());
        PDFRenderer renderer = new PDFRenderer(pdfDoc);

        List<Document> documents = new ArrayList<>();
        for (int i = 0; i < pdfDoc.getNumberOfPages(); i++) {
            // 300 DPI，OCR效果更好
            BufferedImage pageImage = renderer.renderImageWithDPI(i, 300);

            // 调用OCR
            String pageText = ocrClient.recognize(pageImage);
            if (pageText != null && !pageText.isBlank()) {
                Document doc = new Document(pageText,
                    Map.of(
                        "source_id", sourceId,
                        "source_type", "scanned_pdf",
                        "page_number", i + 1,
                        "total_pages", pdfDoc.getNumberOfPages()
                    ));
                documents.add(doc);
            }
        }
        pdfDoc.close();

        // 分块存储
        TokenTextSplitter splitter = new TokenTextSplitter(500, 50, 5, 10000, true);
        vectorStore.add(splitter.apply(documents));
    }
}

Word/Excel/PPT：Tika一网打尽

Apache Tika 是处理 Office 文档的最佳选择，Spring AI 做了开箱即用的封装：

@Service
@Slf4j
public class OfficeDocumentService {

    private final VectorStore vectorStore;
    private final TokenTextSplitter splitter;

    public OfficeDocumentService(VectorStore vectorStore) {
        this.vectorStore = vectorStore;
        this.splitter = new TokenTextSplitter(600, 60, 5, 10000, true);
    }

    /**
     * 处理Office文档（Word/Excel/PPT/LibreOffice格式均可）
     */
    public void ingestOfficeDoc(Resource resource, String sourceId) {
        TikaDocumentReader reader = new TikaDocumentReader(resource);
        List<Document> docs = reader.get();

        if (docs.isEmpty()) {
            log.warn("文档解析结果为空，来源: {}", sourceId);
            return;
        }

        // 清洗和元数据增强
        List<Document> cleaned = docs.stream()
            .filter(doc -> doc.getText() != null && doc.getText().length() > 50)
            .map(doc -> {
                String cleanedText = cleanText(doc.getText());
                Map<String, Object> meta = new HashMap<>(doc.getMetadata());
                meta.put("source_id", sourceId);
                meta.put("source_type", detectFileType(resource));
                meta.put("ingested_at", Instant.now().toString());
                return new Document(cleanedText, meta);
            })
            .collect(Collectors.toList());

        List<Document> chunks = splitter.apply(cleaned);
        vectorStore.add(chunks);
        log.info("Office文档入库完成: {} chunks，来源: {}", chunks.size(), sourceId);
    }

    private String cleanText(String text) {
        return text
            .replaceAll("\\s{3,}", "\n\n")  // 多余空白行合并
            .replaceAll("[\\x00-\\x08\\x0B\\x0C\\x0E-\\x1F]", "")  // 控制字符
            .trim();
    }

    private String detectFileType(Resource resource) {
        String filename = resource.getFilename();
        if (filename == null) return "unknown";
        if (filename.endsWith(".docx") || filename.endsWith(".doc")) return "word";
        if (filename.endsWith(".xlsx") || filename.endsWith(".xls")) return "excel";
        if (filename.endsWith(".pptx") || filename.endsWith(".ppt")) return "powerpoint";
        return "office";
    }
}

Excel 的特殊处理：Excel 里的表格数据，Tika 读出来是平铺的文本，查询效果不好。建议把每个Sheet当作独立文档，并把表头信息作为元数据：

public List<Document> parseExcelWithContext(Resource excelResource) throws Exception {
    Workbook workbook = WorkbookFactory.create(excelResource.getInputStream());
    List<Document> documents = new ArrayList<>();

    for (Sheet sheet : workbook) {
        StringBuilder sb = new StringBuilder();
        String sheetName = sheet.getSheetName();

        // 读取表头（第一行）
        Row headerRow = sheet.getRow(0);
        List<String> headers = new ArrayList<>();
        if (headerRow != null) {
            for (Cell cell : headerRow) {
                headers.add(cell.toString());
            }
        }

        // 将每行数据转换为"字段名: 值"格式，更易检索
        for (int i = 1; i <= sheet.getLastRowNum(); i++) {
            Row row = sheet.getRow(i);
            if (row == null) continue;

            for (int j = 0; j < headers.size(); j++) {
                Cell cell = row.getCell(j);
                if (cell != null && !cell.toString().isBlank()) {
                    sb.append(headers.get(j)).append(": ").append(cell.toString()).append("  ");
                }
            }
            sb.append("\n");
        }

        documents.add(new Document(sb.toString(),
            Map.of("sheet_name", sheetName, "source_type", "excel")));
    }

    workbook.close();
    return documents;
}

网页内容解析

网页解析要处理两个问题：动态渲染（JS生成内容）和噪音过滤（导航栏、广告）。

@Service
@Slf4j
public class WebPageParsingService {

    private final RestTemplate restTemplate;
    private final VectorStore vectorStore;
    private final TokenTextSplitter splitter;

    public void ingestWebPage(String url, String sourceId) {
        try {
            // 方案1：静态HTML，直接用Spring AI内置Reader
            UrlResource resource = new UrlResource(url);
            // Spring AI 1.0 暂无内置WebReader，用Jsoup解析
            String html = fetchHtml(url);
            String text = extractMainContent(html, url);

            Document doc = new Document(text,
                Map.of(
                    "source_id", sourceId,
                    "source_type", "webpage",
                    "url", url,
                    "fetched_at", Instant.now().toString()
                ));

            List<Document> chunks = splitter.apply(List.of(doc));
            vectorStore.add(chunks);
            log.info("网页入库完成: {}，共 {} chunks", url, chunks.size());

        } catch (Exception e) {
            log.error("网页解析失败: {}", url, e);
            throw new DocumentParsingException("网页解析失败: " + url, e);
        }
    }

    private String fetchHtml(String url) {
        HttpHeaders headers = new HttpHeaders();
        headers.set("User-Agent", "Mozilla/5.0 (compatible; KnowledgeBot/1.0)");
        HttpEntity<String> entity = new HttpEntity<>(headers);

        ResponseEntity<String> response = restTemplate.exchange(
            url, HttpMethod.GET, entity, String.class);
        return response.getBody();
    }

    private String extractMainContent(String html, String url) {
        // Jsoup解析，去除噪音
        Document jsoupDoc = Jsoup.parse(html, url);

        // 删除导航、页脚、广告等
        jsoupDoc.select("nav, footer, header, .ad, .advertisement, script, style").remove();

        // 优先取正文区域
        Element main = jsoupDoc.selectFirst("main, article, .content, .article-body, #content");
        if (main != null) {
            return main.text();
        }

        // 降级：取body全文
        return jsoupDoc.body().text();
    }
}

统一文档处理入口

把所有格式统一起来，做一个路由：

@Service
@Slf4j
public class DocumentIngestionRouter {

    private final PdfParsingService pdfService;
    private final OfficeDocumentService officeService;
    private final WebPageParsingService webService;

    public IngestionResult ingest(Object source, String sourceId) {
        try {
            if (source instanceof String url && (url.startsWith("http://") || url.startsWith("https://"))) {
                webService.ingestWebPage(url, sourceId);
                return IngestionResult.success(sourceId, "webpage");
            }

            if (source instanceof Resource resource) {
                String filename = resource.getFilename();
                if (filename == null) throw new IllegalArgumentException("无法识别文件类型");

                String lower = filename.toLowerCase();
                if (lower.endsWith(".pdf")) {
                    pdfService.ingestPdf(resource, sourceId);
                    return IngestionResult.success(sourceId, "pdf");
                } else if (lower.matches(".*\\.(doc|docx|xls|xlsx|ppt|pptx|odt|ods|odp)$")) {
                    officeService.ingestOfficeDoc(resource, sourceId);
                    return IngestionResult.success(sourceId, "office");
                } else if (lower.matches(".*\\.(txt|md|csv)$")) {
                    ingestText(resource, sourceId);
                    return IngestionResult.success(sourceId, "text");
                } else {
                    // 未知格式，尝试Tika兜底
                    officeService.ingestOfficeDoc(resource, sourceId);
                    return IngestionResult.success(sourceId, "unknown");
                }
            }

            throw new IllegalArgumentException("不支持的文档类型: " + source.getClass());

        } catch (Exception e) {
            log.error("文档入库失败: sourceId={}", sourceId, e);
            return IngestionResult.failure(sourceId, e.getMessage());
        }
    }
}

生产踩坑汇总

坑	现象	解决方案
PDF中文乱码	文字解析出来是方块字	检查PDF嵌入字体；换用 PdfBox 替代默认解析
Word表格丢失	表格内容没有解析进来	Tika对复杂表格支持有限，考虑转HTML后解析
大文件OOM	处理100MB+文件时堆溢出	分批处理，每次处理一页/一段，用流式API
网页JS渲染	动态加载内容读不到	引入 Selenium/Playwright 无头浏览器
编码问题	老文档GBK编码读出乱码	Tika能自动检测编码，显式指定 `Charset.forName("GBK")`
图片内文字	PDF里嵌的图片文字漏掉	对图片元素单独调OCR处理

小结

文档解析看起来简单，但格式多样性是真实挑战。一套稳定的方案应该是：

PDF 用 PagePdfDocumentReader，扫描件额外加OCR
Office文档 统一走 TikaDocumentReader，Excel特殊处理表头
网页用Jsoup提取正文，去噪
统一路由 按文件类型分发，做好异常隔离
元数据 每个chunk都要携带来源信息，查询时用于过滤

有了这套基础，RAG知识库的数据质量才有保障。小李用这套方案跑完了他们客户的全部文档，准确率比第一版高了20%，而且代码维护起来清晰多了。