第2195篇：文档图像的智能解析——OCR与LLM结合的工程实践

老张2026/4/30大约 7 分钟

第2195篇：文档图像的智能解析——OCR与LLM结合的工程实践

适读人群：需要处理扫描件、图片文档的Java工程师 | 阅读时长：约16分钟 | 核心价值：OCR+LLM双引擎文档解析的完整工程方案，解决纯OCR识别后理解不足的问题

有个做保险的客户，他们每天要处理几千份理赔材料，大部分是扫描件：病历、诊断证明、检查报告。

最开始他们用传统OCR（Tesseract/百度OCR），识别率挺高，但识别出来的是一堆散乱文字，还得写规则引擎来提取关键信息——病历日期、诊断结果、用药情况。规则写了几百条，漏报误报还是一堆。

后来我们换了一个思路：OCR只负责把图片转文字，LLM负责理解和结构化提取。效果提升很明显，但工程实现上有几个坑让我们折腾了好一阵子。

这篇文章就把这套"OCR+LLM"联合架构的工程实现写清楚。

一、为什么不直接用VLM处理文档图像

很多人第一直觉是：既然GPT-4V能看图，直接把扫描件喂给它不就行了？

这个想法对，但在工程层面有几个约束：

约束1：Token成本

一页A4扫描件通常是2480x3508像素（300dpi），在高清模式下，光这一张图就要消耗约7000个Token。一份10页的病历，光图片就消耗7万Token，加上提示词和输出，单次调用成本超过0.1美元。每天几千份，成本直接爆炸。

约束2：识别精度

对于手写字迹较多、图像质量较差的扫描件，专门的OCR引擎（尤其是针对特定领域微调过的）识别精度通常优于通用VLM。VLM的优势在于理解，不在于识别每个字。

约束3：隐私合规

医疗数据外传给商业API有合规风险。本地OCR + 脱敏后再给LLM的方案，风险可控得多。

最优方案：分工明确

二、图像预处理——OCR识别率的关键

OCR的识别率90%取决于输入图像质量，10%才是引擎本身。

@Component
public class DocumentImageEnhancer {
    
    /**
     * 文档图像增强流水线
     */
    public Mat enhance(Mat inputImage) {
        // 使用OpenCV进行图像增强
        Mat gray = new Mat();
        Mat denoised = new Mat();
        Mat deskewed;
        Mat binaryImage = new Mat();
        
        // 1. 转灰度
        Imgproc.cvtColor(inputImage, gray, Imgproc.COLOR_BGR2GRAY);
        
        // 2. 去噪（非局部均值去噪，保留边缘）
        Photo.fastNlMeansDenoising(gray, denoised, 10, 7, 21);
        
        // 3. 文档纠偏（检测文字倾斜角度并旋转）
        deskewed = deskewDocument(denoised);
        
        // 4. 二值化（自适应阈值，处理不均匀光照）
        Imgproc.adaptiveThreshold(deskewed, binaryImage, 255,
            Imgproc.ADAPTIVE_THRESH_GAUSSIAN_C,
            Imgproc.THRESH_BINARY, 11, 2);
        
        return binaryImage;
    }
    
    /**
     * 文档纠偏：检测Hough线并计算旋转角度
     */
    private Mat deskewDocument(Mat gray) {
        Mat edges = new Mat();
        Imgproc.Canny(gray, edges, 50, 150, 3, false);
        
        Mat lines = new Mat();
        Imgproc.HoughLinesP(edges, lines, 1, Math.PI / 180, 100, 100, 10);
        
        if (lines.empty()) {
            return gray;
        }
        
        // 计算所有检测到的线的角度
        List<Double> angles = new ArrayList<>();
        for (int i = 0; i < lines.rows(); i++) {
            double[] line = lines.get(i, 0);
            double angle = Math.atan2(line[3] - line[1], line[2] - line[0]);
            // 只考虑接近水平的线（文档主要是横向文字）
            if (Math.abs(angle) < Math.PI / 4) {
                angles.add(angle);
            }
        }
        
        if (angles.isEmpty()) {
            return gray;
        }
        
        // 取中位数角度
        Collections.sort(angles);
        double medianAngle = angles.get(angles.size() / 2);
        double degrees = Math.toDegrees(medianAngle);
        
        // 如果倾斜角度小于0.5度，不纠偏（避免过度处理）
        if (Math.abs(degrees) < 0.5) {
            return gray;
        }
        
        // 旋转图像
        Point center = new Point(gray.cols() / 2.0, gray.rows() / 2.0);
        Mat rotationMatrix = Imgproc.getRotationMatrix2D(center, degrees, 1.0);
        Mat rotated = new Mat();
        Imgproc.warpAffine(gray, rotated, rotationMatrix, gray.size(),
            Imgproc.INTER_LINEAR, Core.BORDER_REPLICATE);
        
        return rotated;
    }
}

PaddleOCR的Java集成（通过HTTP API）

PaddleOCR的识别效果在中文场景下显著优于Tesseract，但它是Python服务。工程上推荐用Docker部署PaddleOCR的REST服务，Java通过HTTP调用：

@Component
public class PaddleOCRClient {
    
    private final RestTemplate restTemplate;
    
    @Value("${ocr.paddleocr.url:http://localhost:8866}")
    private String paddleOcrUrl;
    
    public OCRResult recognize(byte[] imageBytes) {
        String base64Image = Base64.getEncoder().encodeToString(imageBytes);
        
        Map<String, String> request = Map.of("image", base64Image);
        
        ResponseEntity<Map> response = restTemplate.postForEntity(
            paddleOcrUrl + "/ocr/predict/ocr_system",
            request,
            Map.class
        );
        
        return parseOCRResponse(response.getBody());
    }
    
    @SuppressWarnings("unchecked")
    private OCRResult parseOCRResponse(Map<String, Object> responseBody) {
        List<TextBlock> textBlocks = new ArrayList<>();
        
        List<Map<String, Object>> results = 
            (List<Map<String, Object>>) responseBody.get("results");
        
        if (results == null) {
            return new OCRResult(List.of(), 0.0);
        }
        
        double totalConfidence = 0;
        for (Map<String, Object> result : results) {
            String text = (String) result.get("text");
            double confidence = ((Number) result.get("confidence")).doubleValue();
            List<List<Integer>> bbox = (List<List<Integer>>) result.get("text_region");
            
            textBlocks.add(new TextBlock(text, confidence, bbox));
            totalConfidence += confidence;
        }
        
        double avgConfidence = results.isEmpty() ? 0 : totalConfidence / results.size();
        return new OCRResult(textBlocks, avgConfidence);
    }
    
    public record OCRResult(List<TextBlock> blocks, double avgConfidence) {
        public String toPlainText() {
            return blocks.stream()
                .sorted(Comparator.comparingDouble(b -> b.bbox().get(0).get(1))) // 按Y坐标排序
                .map(TextBlock::text)
                .collect(Collectors.joining("\n"));
        }
    }
    
    public record TextBlock(String text, double confidence, List<List<Integer>> bbox) {}
}

三、LLM理解层：把OCR结果变成结构化数据

OCR给了我们原始文字，LLM负责理解和提取。这一层的关键是Prompt设计：

@Service
public class DocumentStructureExtractor {
    
    private final ChatClient chatClient;
    private final ObjectMapper objectMapper;
    
    // 针对不同文档类型的提取模板
    private static final Map<DocumentType, String> EXTRACTION_PROMPTS = Map.of(
        DocumentType.MEDICAL_RECORD, """
            以下是从医疗文档中OCR识别的文字，请提取结构化信息。
            注意：
            1. OCR可能有识别错误，请根据上下文推断正确内容
            2. 日期格式统一为 YYYY-MM-DD
            3. 如果某字段无法确定，值设为null
            4. 只返回JSON，不要有任何解释文字
            
            需要提取的字段：
            {
              "patientName": "患者姓名",
              "visitDate": "就诊日期",
              "diagnosis": ["诊断结果列表"],
              "medications": [{"name": "药品名", "dosage": "剂量", "frequency": "频次"}],
              "doctorName": "医生姓名",
              "hospitalName": "医院名称",
              "chiefComplaint": "主诉"
            }
            
            OCR识别内容：
            {ocrText}
            """,
        DocumentType.INVOICE, """
            以下是从发票/单据OCR识别的文字，请提取结构化信息。
            {ocrText}
            
            提取为如下JSON格式（金额保留两位小数）：
            {
              "invoiceNo": "发票号码",
              "invoiceDate": "开票日期",
              "sellerName": "销售方名称",
              "buyerName": "购买方名称",
              "totalAmount": 0.00,
              "taxAmount": 0.00,
              "items": [{"description": "品名", "quantity": 0, "unitPrice": 0.00, "amount": 0.00}]
            }
            """
    );
    
    public <T> T extractStructuredData(String ocrText, DocumentType docType, 
                                        Class<T> targetClass) {
        String promptTemplate = EXTRACTION_PROMPTS.get(docType);
        if (promptTemplate == null) {
            throw new UnsupportedOperationException("不支持的文档类型: " + docType);
        }
        
        String prompt = promptTemplate.replace("{ocrText}", ocrText);
        
        // 使用JSON模式确保输出格式正确
        String jsonResponse = chatClient.prompt()
            .user(prompt)
            .options(OpenAiChatOptions.builder()
                .withResponseFormat(new ResponseFormat(ResponseFormat.Type.JSON_OBJECT))
                .withTemperature(0.0f) // 提取任务用0 temperature，提高一致性
                .build())
            .call()
            .content();
        
        try {
            return objectMapper.readValue(jsonResponse, targetClass);
        } catch (JsonProcessingException e) {
            // JSON解析失败时尝试修复（LLM偶尔会输出不标准的JSON）
            String repairedJson = repairJson(jsonResponse);
            try {
                return objectMapper.readValue(repairedJson, targetClass);
            } catch (JsonProcessingException ex) {
                throw new DocumentExtractionException("JSON解析失败: " + jsonResponse, ex);
            }
        }
    }
    
    /**
     * 简单的JSON修复：处理常见的LLM输出问题
     */
    private String repairJson(String json) {
        // 去除markdown代码块标记
        json = json.replaceAll("```json\\s*", "").replaceAll("```\\s*", "");
        // 去除首尾空白
        json = json.trim();
        return json;
    }
}

四、置信度驱动的混合策略

不是所有区域的OCR结果都可信。对于低置信度的区域，可以用VLM重新识别：

@Service
public class HybridOCRService {
    
    private final PaddleOCRClient paddleOCRClient;
    private final VisionService visionService;
    
    // 低于此置信度的区域，用VLM重新识别
    private static final double CONFIDENCE_THRESHOLD = 0.7;
    
    public String processDocument(byte[] imageBytes) {
        // 1. PaddleOCR全文识别
        PaddleOCRClient.OCRResult ocrResult = paddleOCRClient.recognize(imageBytes);
        
        // 2. 识别低置信区域
        List<PaddleOCRClient.TextBlock> lowConfBlocks = ocrResult.blocks().stream()
            .filter(b -> b.confidence() < CONFIDENCE_THRESHOLD)
            .collect(Collectors.toList());
        
        if (lowConfBlocks.isEmpty()) {
            // 全部高置信，直接用OCR结果
            return ocrResult.toPlainText();
        }
        
        // 3. 如果低置信区域占比超过30%，用VLM整体重新识别
        double lowConfRatio = (double) lowConfBlocks.size() / ocrResult.blocks().size();
        if (lowConfRatio > 0.3) {
            return recognizeWithVLM(imageBytes);
        }
        
        // 4. 只对低置信区域截图并用VLM补充
        String baseText = ocrResult.toPlainText();
        // 这里省略了截图逻辑，实际需要按bbox裁剪图片
        // 然后对每个低置信区域单独调用VLM
        return baseText; // 简化演示
    }
    
    private String recognizeWithVLM(byte[] imageBytes) {
        VisionRequest request = VisionRequest.builder()
            .images(List.of(ImageInput.fromBytes(imageBytes, "image/jpeg")))
            .prompt("""
                请仔细阅读这张文档图片，完整提取其中的所有文字内容。
                要求：
                1. 保持原文档的段落结构
                2. 如果有表格，用文字形式描述表格内容
                3. 只输出文档中的文字，不要添加任何解释
                """)
            .build();
        
        return visionService.analyzeImage(request).getContent();
    }
}

五、生产环境的工程要点

异步处理架构

文档处理是CPU密集型（图像处理）+ IO密集型（API调用）的混合任务，建议用消息队列解耦：

@Component
public class DocumentProcessingWorker {
    
    @RabbitListener(queues = "document.processing")
    public void processDocument(DocumentProcessingMessage message) {
        String taskId = message.getTaskId();
        try {
            // 更新状态为处理中
            taskStatusService.updateStatus(taskId, TaskStatus.PROCESSING);
            
            // 执行处理流水线
            byte[] imageBytes = storageService.download(message.getStoragePath());
            String ocrText = hybridOCRService.processDocument(imageBytes);
            Object structuredData = documentStructureExtractor.extractStructuredData(
                ocrText, message.getDocumentType(), message.getTargetClass());
            
            // 保存结果
            taskStatusService.saveResult(taskId, structuredData);
            taskStatusService.updateStatus(taskId, TaskStatus.COMPLETED);
            
        } catch (Exception e) {
            log.error("文档处理失败: taskId={}", taskId, e);
            taskStatusService.updateStatus(taskId, TaskStatus.FAILED, e.getMessage());
        }
    }
}

处理指标监控

对于文档处理系统，以下指标必须监控：

OCR平均置信度（低于0.7时告警）
VLM调用比例（如果VLM占比超预期，说明文档质量有问题）
结构化提取成功率（JSON解析失败的比例）
端到端处理时间（分P50/P95/P99）