PDF 解析的工程实践——不只是提取文字那么简单

老张2026/4/30大约 7 分钟

PDF 解析的工程实践——不只是提取文字那么简单

适读人群：做文档处理、知识库构建、RAG 系统的开发者 | 阅读时长：约 15 分钟 | 核心价值：掌握各类 PDF 的解析方案选型，少踩坑

我做过的项目里，PDF 解析是踩坑最多、最隐蔽的技术点之一。

不是因为它难，而是因为它看起来简单。

第一次做企业知识库项目，我很自信地用了 PyPDF2，把所有 PDF 丢进去提取文字，然后建向量索引。测试几十个问题，效果不错。

上线第一周，一个用户反馈："我问公司今年的出货量，AI 说不知道。"

我去查了一下，发现那份年度报告里，出货量数据在一个精心排版的表格里。PyPDF2 提取出来的内容是这样的：

季度 Q1 Q2 Q3 Q4 合计 出货量（万件） 15.2 18.7 22.1 31.5 87.5 同比增长 8% 12% 15% 23% 15%

表格里的内容全部变成了一行混排的文字，完全丢失了行列对应关系。

这就是 PDF 解析的现实：表面上成功提取了文字，实际上丢掉了信息。

先理解 PDF 的本质

PDF 格式的核心问题是：它被设计成用来展示的，不是用来提取信息的。

PDF 里存储的不是"段落"、"表格"、"标题"，而是一堆"在坐标 (x, y) 位置渲染这个字符，字体 X，大小 Y"的指令。

所以 PDF 解析本质上是在做逆向工程：从渲染指令重建语义结构。

这解释了为什么 PDF 解析这么难，也解释了为什么不同类型的 PDF 需要不同的处理方案：

类型一：原生 PDF（Born Digital） 源文件是 Word/Excel/PPT 等软件直接导出的 PDF，内部有完整的文字层。解析相对容易，但表格和多列布局仍有挑战。

类型二：扫描件 PDF 把纸质文件扫描后存成 PDF，里面只有图片，没有文字层。必须用 OCR 先识别文字。

类型三：混合 PDF 部分页是原生文字，部分页是扫描图片（常见于老文件的补充附件）。需要先判断每页类型，再分别处理。

类型四：加密 PDF 有密码保护，或者文字选中被禁用。解析前要先处理权限问题。

各方案的实测对比

我在一个企业知识库项目里系统测试了几个主流解析方案，测试文档包含：

含表格的年报（多列、合并单元格）
技术文档（代码块、多级列表）
扫描版合同（清晰度中等）
图文混排的产品说明书

pdfplumber

最适合处理原生 PDF 里的表格，基于 pdfminer 开发，对表格结构的还原能力很强。

import pdfplumber
import json

def extract_with_pdfplumber(pdf_path: str) -> dict:
    """
    用 pdfplumber 提取 PDF，重点处理表格
    """
    result = {
        'text_pages': [],
        'tables': []
    }
    
    with pdfplumber.open(pdf_path) as pdf:
        for page_num, page in enumerate(pdf.pages):
            page_data = {
                'page': page_num + 1,
                'text': '',
                'tables': []
            }
            
            # 提取表格
            tables = page.extract_tables()
            for table_idx, table in enumerate(tables):
                if table:
                    # 过滤空行
                    cleaned_table = [
                        row for row in table 
                        if any(cell and str(cell).strip() for cell in row)
                    ]
                    if cleaned_table:
                        page_data['tables'].append({
                            'table_index': table_idx,
                            'data': cleaned_table
                        })
            
            # 提取文字（排除表格区域，避免重复）
            if page_data['tables']:
                # 有表格的页面，用 bbox 排除表格区域提取文字
                table_bboxes = [t.bbox for t in page.find_tables()]
                words = page.extract_words()
                non_table_words = [
                    w for w in words
                    if not any(
                        w['x0'] >= bbox[0] and w['x1'] <= bbox[2] and
                        w['top'] >= bbox[1] and w['bottom'] <= bbox[3]
                        for bbox in table_bboxes
                    )
                ]
                page_data['text'] = ' '.join(w['text'] for w in non_table_words)
            else:
                page_data['text'] = page.extract_text() or ''
            
            result['text_pages'].append(page_data)
    
    return result


def table_to_markdown(table_data: list[list]) -> str:
    """将表格数据转换为 Markdown 格式，适合送给 LLM"""
    if not table_data:
        return ''
    
    # 第一行作为表头
    header = table_data[0]
    rows = table_data[1:]
    
    # 清理 None 值
    header = [str(cell) if cell is not None else '' for cell in header]
    
    lines = []
    lines.append('| ' + ' | '.join(header) + ' |')
    lines.append('| ' + ' | '.join(['---'] * len(header)) + ' |')
    
    for row in rows:
        row = [str(cell) if cell is not None else '' for cell in row]
        # 确保列数一致
        while len(row) < len(header):
            row.append('')
        lines.append('| ' + ' | '.join(row[:len(header)]) + ' |')
    
    return '\n'.join(lines)

pdfplumber 的适用场景：

原生 PDF，有清晰的表格结构
需要精确提取表格数据
多列布局（报纸风格的排版）

不适用：扫描件（需要 OCR）、格式极其复杂的 PDF

pypdf（原 PyPDF2）

纯文字提取，速度快，适合格式简单的文档。

from pypdf import PdfReader

def extract_with_pypdf(pdf_path: str) -> dict:
    """
    用 pypdf 快速提取文字
    适合格式简单的原生 PDF
    """
    reader = PdfReader(pdf_path)
    pages = []
    
    for page_num, page in enumerate(reader.pages):
        text = page.extract_text(
            extraction_mode="layout",  # 尽量保留布局
            layout_mode_space_vertically=False
        )
        pages.append({
            'page': page_num + 1,
            'text': text or ''
        })
    
    # 检测是否是扫描件（文字极少）
    total_chars = sum(len(p['text']) for p in pages)
    avg_chars = total_chars / len(pages) if pages else 0
    is_scanned = avg_chars < 100  # 每页平均少于100字符，可能是扫描件
    
    return {
        'pages': pages,
        'is_likely_scanned': is_scanned,
        'total_pages': len(pages)
    }

OCR 方案（扫描件必备）

扫描件 PDF 必须走 OCR。我测试过的几个方案：

import pytesseract
from pdf2image import convert_from_path
from PIL import Image
import io

def ocr_pdf(pdf_path: str, lang: str = 'chi_sim+eng') -> list[dict]:
    """
    对扫描件 PDF 进行 OCR
    需要安装 tesseract-ocr 和中文语言包
    """
    # 转换为高分辨率图片
    images = convert_from_path(pdf_path, dpi=300)
    
    results = []
    for page_num, image in enumerate(images):
        # OCR 前做图像增强
        enhanced_image = preprocess_for_ocr(image)
        
        # OCR
        text = pytesseract.image_to_string(
            enhanced_image,
            lang=lang,
            config='--psm 6'  # 假设文本是均匀的文字块
        )
        
        results.append({
            'page': page_num + 1,
            'text': text,
            'method': 'ocr'
        })
    
    return results


def preprocess_for_ocr(image: Image.Image) -> Image.Image:
    """图像预处理，提高 OCR 准确率"""
    import cv2
    import numpy as np
    
    # 转换为 numpy 数组
    img_array = np.array(image)
    
    # 转灰度
    gray = cv2.cvtColor(img_array, cv2.COLOR_RGB2GRAY)
    
    # 降噪
    denoised = cv2.fastNlMeansDenoising(gray, h=10)
    
    # 二值化（自适应阈值，对光照不均的扫描件更好）
    binary = cv2.adaptiveThreshold(
        denoised, 255,
        cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
        cv2.THRESH_BINARY, 11, 2
    )
    
    return Image.fromarray(binary)

国产 OCR 方案推荐 PaddleOCR，中文效果比 Tesseract 好很多：

from paddleocr import PaddleOCR

def ocr_with_paddle(image_path: str) -> str:
    """
    用 PaddleOCR 识别，中文效果更好
    """
    ocr = PaddleOCR(use_angle_cls=True, lang='ch')
    result = ocr.ocr(image_path, cls=True)
    
    text_lines = []
    for line in result[0]:
        text_lines.append(line[1][0])  # 提取识别文字
    
    return '\n'.join(text_lines)

LlamaParse（云端服务，效果最好）

LlamaIndex 提供的云端 PDF 解析服务，底层用了多模态模型，对复杂 PDF 的处理效果目前最好。

from llama_parse import LlamaParse
from llama_index.core import SimpleDirectoryReader

def parse_with_llamaparse(pdf_path: str) -> str:
    """
    用 LlamaParse 处理复杂 PDF
    需要 LLAMA_CLOUD_API_KEY
    """
    parser = LlamaParse(
        result_type="markdown",  # 输出 Markdown 格式，保留结构
        language="ch_sim",  # 指定中文
        verbose=True,
    )
    
    file_extractor = {".pdf": parser}
    documents = SimpleDirectoryReader(
        input_files=[pdf_path],
        file_extractor=file_extractor
    ).load_data()
    
    # 合并所有页面的内容
    full_text = "\n\n---\n\n".join([doc.text for doc in documents])
    return full_text

LlamaParse 的缺点：收费（免费额度有限），数据发到云端，不适合保密文档。

方案选型的决策树

你的 PDF 是什么类型？
|
├── 原生 PDF（非扫描件）
│   ├── 主要是纯文字，格式简单
│   │   └── pypdf，速度快成本低
│   ├── 有表格或复杂布局
│   │   └── pdfplumber，表格提取更准确
│   └── 格式极复杂（图文混排、复杂表格）
│       ├── 数据允许出境 → LlamaParse
│       └── 数据不能出境 → 多模态模型本地处理
│
├── 扫描件 PDF
│   ├── 中文为主
│   │   └── PaddleOCR，效果好
│   ├── 英文或中英混合
│   │   └── Tesseract 或 PaddleOCR
│   └── 有复杂表格
│       └── Vision 模型（GPT-4V/Claude Vision）
│
└── 混合 PDF（部分页是扫描件）
    └── 先检测每页类型，分别处理
        原生页 → pypdf/pdfplumber
        扫描页 → OCR

检测 PDF 类型的代码

def detect_pdf_type(pdf_path: str) -> dict:
    """
    检测 PDF 类型，决定使用哪种解析策略
    """
    reader = PdfReader(pdf_path)
    page_types = []
    
    for page_num, page in enumerate(reader.pages):
        text = page.extract_text() or ''
        images = page.images
        
        char_count = len(text.strip())
        has_images = len(images) > 0
        
        if char_count > 200:
            page_type = 'native_text'
        elif has_images and char_count < 50:
            page_type = 'scanned'
        elif has_images and char_count >= 50:
            page_type = 'mixed'
        else:
            page_type = 'unknown'
        
        page_types.append({
            'page': page_num + 1,
            'type': page_type,
            'char_count': char_count,
            'has_images': has_images
        })
    
    # 统计
    type_counts = {}
    for pt in page_types:
        t = pt['type']
        type_counts[t] = type_counts.get(t, 0) + 1
    
    # 判断整体类型
    total = len(page_types)
    if type_counts.get('scanned', 0) / total > 0.7:
        overall_type = 'mostly_scanned'
    elif type_counts.get('native_text', 0) / total > 0.7:
        overall_type = 'mostly_native'
    else:
        overall_type = 'mixed'
    
    return {
        'overall_type': overall_type,
        'total_pages': total,
        'page_types': page_types,
        'type_distribution': type_counts
    }

给 RAG 系统的专项建议

如果你做 PDF 解析是为了建知识库，还有几个特别重要的点：

切片策略要考虑结构

不要简单地按字数切片，表格内的内容和前后文要保持在一起：

def smart_chunk_pdf_content(pages: list[dict], chunk_size: int = 500) -> list[dict]:
    """
    智能切片，保持表格完整性
    """
    chunks = []
    current_chunk = ''
    
    for page_data in pages:
        # 表格作为独立 chunk（不和其他文字混在一起）
        for table in page_data.get('tables', []):
            table_md = table_to_markdown(table['data'])
            if table_md:
                chunks.append({
                    'content': table_md,
                    'type': 'table',
                    'page': page_data['page'],
                    'table_index': table['table_index']
                })
        
        # 文字按大小切片
        text = page_data.get('text', '')
        paragraphs = text.split('\n\n')
        
        for para in paragraphs:
            if len(current_chunk) + len(para) > chunk_size:
                if current_chunk:
                    chunks.append({'content': current_chunk, 'type': 'text'})
                current_chunk = para
            else:
                current_chunk += '\n\n' + para if current_chunk else para
    
    if current_chunk:
        chunks.append({'content': current_chunk, 'type': 'text'})
    
    return chunks

保留来源信息

每个 chunk 要记录它来自哪个文档、哪一页，不然引用的时候没法溯源。

回到文章开头那个知识库项目，后来我把解析方案改成了 pdfplumber + 专门的表格处理逻辑，那个"出货量"的问题就解决了。

PDF 解析这事，入门容易做好难。用一个 pdf.extract_text() 糊过去，能跑通，但知识库的质量会悄悄地低一头。花时间把这块做扎实，值得。