如何优化 OCR + LLM 的文档信息提取流程以提升准确率？

OCR+LLM优化

问题分析

OCR（光学字符识别）与 LLM（大语言模型）结合的文档信息提取流程，是当前企业数字化转型的核心技术栈之一。典型应用包括发票识别、合同解析、证件录入等场景。然而，实际落地过程中，开发者常面临准确率不达标的困境，即便使用了 SOTA（State-of-the-Art）级别的 OCR 引擎和强大的 LLM，端到端的提取准确率仍可能低于 80%。

问题首先出在 OCR 层面。商业文档的版式多样性远超预期：表格嵌套、水印遮挡、印章覆盖、倾斜拍摄、低分辨率扫描等干扰因素，会导致 OCR 输出缺失、错乱或幻觉。特别是中文场景下，手写体与印刷体混排、生僻字识别错误、标点符号丢失等问题尤为突出。

其次是 OCR 与 LLM 的衔接层问题。OCR 输出的原始文本往往缺乏结构信息——"坐标"和"版式"丢失，只剩下一维文本流。LLM 接收到的是"去视觉化"后的数据，无法理解哪些内容属于同一行、哪些单元格构成一个完整的表格。这种信息损失导致 LLM 在处理复杂版式时频繁出错。

第三是 LLM 层的提取能力限制。尽管现代 LLM 具备强大的理解能力，但其训练数据主要是自然语言文本，而非结构化文档解析场景。面对 OCR 输出的噪声（错别字、乱序、重复），LLM 可能产生幻觉，编造不存在的字段内容，或将不同字段混淆。

最后是 Prompt 工程的不足。许多开发者直接将 OCR 文本喂给 LLM，期望其自动完成提取。但缺乏明确的字段定义、示例引导和约束机制，LLM 的输出往往格式不规范，难以直接对接后续业务系统。

解决原理

优化端到端提取准确率需要采用分层优化策略：

第一层：OCR 引擎优化

选择合适的 OCR 引擎是基础。对于印刷体中文文档，PaddleOCR、EasyOCR、百度 AI 等表现较好；对于手写体或复杂场景，可能需要定制训练。除了引擎选择，预处理同样关键：图像去噪、倾斜矫正、对比度增强都能提升识别率。

关键优化点是保留 OCR 的"版式坐标"信息。现代 OCR 引擎（如 PaddleOCR）可以输出每个文字块的坐标（bounding box）和置信度。这些信息对于后续重建表格结构至关重要。

第二层：结构化重建

将 OCR 的一维输出重建为二维结构。核心思路是利用坐标信息进行空间聚类：同一水平线上的文字块合并为一行，同一垂直区域的行合并为一列。对于表格识别，可以使用启发式算法或专门的表格结构识别模型（如 Table Transformer）。

另一种思路是采用版面分析（Layout Analysis）技术，先对文档进行区域划分（标题、正文、表格、图表），再对不同区域采用不同的解析策略。

第三层：LLM Prompt 优化

设计结构化的 Prompt，包括三部分：

角色定义：明确 LLM 扮演"信息提取专家"角色
字段规范：精确列出需要提取的字段名、类型和约束
Few-shot 示例：提供标注好的示例，引导 LLM 学习输出格式

对于复杂场景，可采用 Chain-of-Thought（思维链）技术，让 LLM 先分析文档结构，再逐字段提取，最后验证一致性。

第四层：后处理与验证

对 LLM 输出进行规则验证和交叉检查。例如，日期字段必须符合日期格式，金额字段必须是数字，关键字段不能为空。对于可疑结果，可以触发人工审核或二次 LLM 校验。

程序实现与说明

import re
import json
from typing import List, Dict, Any, Optional, Tuple
from dataclasses import dataclass
from PIL import Image
import cv2
import numpy as np

# OCR 引擎导入（以 PaddleOCR 为例）
from paddleocr import PaddleOCR

# LLM 导入（以 LangChain + OpenAI 为例）
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import JsonOutputParser
from pydantic import BaseModel, Field


# ================== 数据结构定义 ==================

@dataclass
class TextBlock:
    """OCR 识别的文字块"""
    text: str  # 识别的文本内容
    bbox: List[List[int]]  # 四个角的坐标 [[x1,y1], [x2,y2], [x3,y3], [x4,y4]]
    confidence: float  # 识别置信度


@dataclass
class DocumentRegion:
    """文档区域（版面分析结果）"""
    region_type: str  # 'title', 'paragraph', 'table', 'header', 'footer'
    bbox: List[int]  # [x_min, y_min, x_max, y_max]
    blocks: List[TextBlock]


class ExtractedInvoice(BaseModel):
    """发票信息提取结果的数据模型"""
    invoice_number: Optional[str] = Field(description="发票号码")
    invoice_date: Optional[str] = Field(description="开票日期")
    buyer_name: Optional[str] = Field(description="购买方名称")
    buyer_tax_id: Optional[str] = Field(description="购买方税号")
    seller_name: Optional[str] = Field(description="销售方名称")
    seller_tax_id: Optional[str] = Field(description="销售方税号")
    total_amount: Optional[float] = Field(description="价税合计")
    tax_amount: Optional[float] = Field(description="税额")


# ================== 图像预处理模块 ==================

class ImagePreprocessor:
    """
    图像预处理器
    负责对原始图像进行去噪、倾斜矫正、增强等处理
    """
    
    def denoise(self, image: np.ndarray) -> np.ndarray:
        """
        去除图像噪点
        使用形态学操作去除小的噪声点
        """
        # 转换为灰度图
        gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
        
        # 使用双边滤波保留边缘的同时去除噪点
        # d=9 是邻域直径，sigmaColor/sigmaSpace 控制滤波强度
        denoised = cv2.bilateralFilter(gray, d=9, sigmaColor=75, sigmaSpace=75)
        
        return denoised
    
    def deskew(self, image: np.ndarray) -> np.ndarray:
        """
        倾斜矫正
        基于霍夫变换检测文本行角度并旋转校正
        """
        # 边缘检测
        edges = cv2.Canny(image, 50, 150, apertureSize=3)
        
        # 霍夫变换检测直线
        lines = cv2.HoughLinesP(
            edges, 
            rho=1, 
            theta=np.pi/180, 
            threshold=100,
            minLineLength=100,
            maxLineGap=10
        )
        
        if lines is None or len(lines) == 0:
            return image  # 未检测到直线，跳过矫正
        
        # 计算所有直线的角度
        angles = []
        for line in lines:
            x1, y1, x2, y2 = line[0]
            if x2 - x1 != 0:
                angle = np.arctan2(y2 - y1, x2 - x1) * 180 / np.pi
                # 只考虑接近水平的线（-10° ~ 10°）
                if abs(angle) < 45:
                    angles.append(angle)
        
        if len(angles) == 0:
            return image
        
        # 取中位数角度（比平均值更鲁棒）
        median_angle = np.median(angles)
        
        # 如果角度很小，不进行旋转（避免过度矫正）
        if abs(median_angle) < 0.5:
            return image
        
        # 旋转图像
        (h, w) = image.shape[:2]
        center = (w // 2, h // 2)
        rotation_matrix = cv2.getRotationMatrix2D(center, median_angle, 1.0)
        rotated = cv2.warpAffine(image, rotation_matrix, (w, h), 
                                  flags=cv2.INTER_CUBIC,
                                  borderMode=cv2.BORDER_REPLICATE)
        
        return rotated
    
    def enhance_contrast(self, image: np.ndarray) -> np.ndarray:
        """
        对比度增强
        使用 CLAHE（对比度受限的自适应直方图均衡化）
        """
        clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
        enhanced = clahe.apply(image)
        return enhanced
    
    def preprocess(self, image_path: str) -> np.ndarray:
        """
        完整预处理流水线
        """
        # 读取图像
        image = cv2.imread(image_path)
        if image is None:
            raise ValueError(f"无法读取图像: {image_path}")
        
        # 依次执行预处理步骤
        processed = self.denoise(image)
        processed = self.deskew(processed)
        processed = self.enhance_contrast(processed)
        
        return processed


# ================== OCR 模块 ==================

class OCREngine:
    """
    OCR 引擎封装
    支持多种后端，统一输出格式
    """
    
    def __init__(self, lang: str = 'ch', use_gpu: bool = True):
        """
        初始化 PaddleOCR
        lang: 语言代码，'ch' 支持中英文混合
        use_gpu: 是否使用 GPU 加速
        """
        self.ocr = PaddleOCR(
            use_angle_cls=True,  # 启用角度分类，处理旋转文字
            lang=lang,
            use_gpu=use_gpu,
            show_log=False
        )
    
    def recognize(self, image: np.ndarray) -> List[TextBlock]:
        """
        执行 OCR 识别，返回结构化的文字块列表
        """
        results = self.ocr.ocr(image, cls=True)
        
        blocks = []
        for line in results[0]:  # PaddleOCR 返回的是嵌套列表
            bbox = line[0]  # 四个角坐标
            text_info = line[1]  # (文本, 置信度)
            
            block = TextBlock(
                text=text_info[0],
                bbox=[[int(p[0]), int(p[1])] for p in bbox],
                confidence=text_info[1]
            )
            blocks.append(block)
        
        return blocks


# ================== 结构重建模块 ==================

class StructureReconstructor:
    """
    从 OCR 输出重建文档结构
    处理表格、键值对等复杂版式
    """
    
    def reconstruct_table(self, blocks: List[TextBlock], 
                          y_threshold: int = 10,
                          x_threshold: int = 20) -> List[List[str]]:
        """
        重建表格结构
        y_threshold: 垂直方向合并阈值（像素）
        x_threshold: 水平方向合并阈值（像素）
        """
        if not blocks:
            return []
        
        # 计算每个块的垂直中心坐标
        blocks_with_center = []
        for block in blocks:
            y_center = sum(p[1] for p in block.bbox) / 4
            x_min = min(p[0] for p in block.bbox)
            blocks_with_center.append({
                'block': block,
                'y_center': y_center,
                'x_min': x_min
            })
        
        # 按垂直坐标排序
        blocks_with_center.sort(key=lambda x: x['y_center'])
        
        # 聚类为行
        rows = []
        current_row = [blocks_with_center[0]]
        
        for i in range(1, len(blocks_with_center)):
            if abs(blocks_with_center[i]['y_center'] - 
                   blocks_with_center[i-1]['y_center']) < y_threshold:
                # 同一行
                current_row.append(blocks_with_center[i])
            else:
                # 新行
                rows.append(current_row)
                current_row = [blocks_with_center[i]]
        
        rows.append(current_row)  # 添加最后一行
        
        # 每行内按水平坐标排序，提取文本
        table_data = []
        for row in rows:
            row.sort(key=lambda x: x['x_min'])
            row_texts = [item['block'].text for item in row]
            table_data.append(row_texts)
        
        return table_data
    
    def extract_key_value_pairs(self, blocks: List[TextBlock],
                                 separator: str = ':') -> Dict[str, str]:
        """
        提取键值对
        适用于格式化的字段，如"发票号码：12345"
        """
        kv_pairs = {}
        
        for block in blocks:
            # 尝试分割键值
            if separator in block.text:
                parts = block.text.split(separator, 1)
                if len(parts) == 2:
                    key = parts[0].strip()
                    value = parts[1].strip()
                    kv_pairs[key] = value
        
        return kv_pairs


# ================== LLM 提取模块 ==================

class LLMExtractor:
    """
    使用 LLM 进行信息提取
    采用结构化输出确保格式一致
    """
    
    def __init__(self, model_name: str = "gpt-4o"):
        self.llm = ChatOpenAI(model=model_name, temperature=0)
        self.parser = JsonOutputParser(pydantic_object=ExtractedInvoice)
    
    def build_prompt(self) -> ChatPromptTemplate:
        """
        构建结构化的提取 Prompt
        """
        prompt = ChatPromptTemplate.from_messages([
            ("system", """你是一位专业的财务文档信息提取专家。
你的任务是从 OCR 识别的文本中提取发票关键信息。

提取要求：
1. 严格按照字段定义提取，不要遗漏或添加字段
2. 如果某字段在文本中不存在，返回 null
3. 金额字段需转换为纯数字（去掉货币符号和千分位符）
4. 日期字段统一转换为 YYYY-MM-DD 格式
5. 对于识别错误的文本，根据上下文合理推断修正

输出格式必须是严格的 JSON。

{format_instructions}"""),
            ("human", """以下是 OCR 识别的发票文本内容：

{ocr_text}

请提取发票信息：""")
        ])
        
        return prompt.partial(
            format_instructions=self.parser.get_format_instructions()
        )
    
    def extract(self, ocr_text: str) -> ExtractedInvoice:
        """
        执行信息提取
        """
        prompt = self.build_prompt()
        chain = prompt | self.llm | self.parser
        
        result = chain.invoke({"ocr_text": ocr_text})
        return result


# ================== 完整流水线 ==================

class DocumentExtractionPipeline:
    """
    端到端文档信息提取流水线
    整合预处理、OCR、结构重建、LLM提取
    """
    
    def __init__(self):
        self.preprocessor = ImagePreprocessor()
        self.ocr_engine = OCREngine(lang='ch', use_gpu=False)
        self.reconstructor = StructureReconstructor()
        self.extractor = LLMExtractor()
    
    def process(self, image_path: str) -> Dict[str, Any]:
        """
        完整处理流水线
        """
        # Step1: 图像预处理
        print(f"[1/4] 预处理图像: {image_path}")
        processed_image = self.preprocessor.preprocess(image_path)
        
        # Step2: OCR 识别
        print("[2/4] 执行 OCR 识别...")
        blocks = self.ocr_engine.recognize(processed_image)
        print(f"    识别到 {len(blocks)} 个文字块")
        
        # Step3: 结构重建
        print("[3/4] 重建文档结构...")
        # 尝试提取键值对
        kv_pairs = self.reconstructor.extract_key_value_pairs(blocks)
        # 尝试重建表格
        table_data = self.reconstructor.reconstruct_table(blocks)
        
        # 组合为结构化文本
        structured_text = self._format_for_llm(blocks, kv_pairs, table_data)
        
        # Step4: LLM 提取
        print("[4/4] LLM 信息提取...")
        extracted_info = self.extractor.extract(structured_text)
        
        return {
            'ocr_blocks': blocks,
            'key_value_pairs': kv_pairs,
            'table_data': table_data,
            'extracted_info': extracted_info.model_dump()
        }
    
    def _format_for_llm(self, blocks: List[TextBlock], 
                        kv_pairs: Dict[str, str],
                        table_data: List[List[str]]) -> str:
        """
        将 OCR 结果格式化为适合 LLM 处理的文本
        """
        lines = []
        
        # 添加键值对部分
        if kv_pairs:
            lines.append("=== 字段信息 ===")
            for key, value in kv_pairs.items():
                lines.append(f"{key}: {value}")
            lines.append("")
        
        # 添加表格部分
        if table_data:
            lines.append("=== 表格内容 ===")
            for row in table_data:
                lines.append(" | ".join(row))
            lines.append("")
        
        # 添加原始文本（按阅读顺序）
        lines.append("=== 全文内容 ===")
        for block in blocks:
            lines.append(block.text)
        
        return "\n".join(lines)


# ================== 使用示例 ==================

if __name__ == "__main__":
    pipeline = DocumentExtractionPipeline()
    
    # 处理发票图像
    result = pipeline.process("invoice_sample.jpg")
    
    print("\n" + "=" * 60)
    print("提取结果：")
    print("=" * 60)
    print(json.dumps(result['extracted_info'], indent=2, ensure_ascii=False))

关键代码行解析：

cv2.bilateralFilter(gray, d=9, sigmaColor=75, sigmaSpace=75)：双边滤波是图像去噪的常用方法。相比高斯模糊，它能保留边缘锐度，这对文字识别尤为重要。d=9 是滤波邻域直径，值越大越慢但效果越好。
cv2.Canny(image, 50, 150)：Canny 边缘检测是霍夫变换的前置步骤。50 和 150 是高低阈值，按 1:3 比例设置是经典参数。
use_angle_cls=True：PaddleOCR 的角度分类器能识别并纠正 90°、180°、270° 旋转的文字。这在手机拍摄场景中非常常见，启用后可显著提升识别率。
JsonOutputParser(pydantic_object=ExtractedInvoice)：LangChain 的结构化输出解析器，将 LLM 输出强制转换为 Pydantic 模型。这解决了 LLM 输出格式不规范的问题，便于后续系统对接。
y_threshold: int = 10：表格行聚类的垂直阈值。如果两个文字块的 y 坐标差小于 10 像素，认为是同一行。这个值需要根据文档分辨率调整，一般设为平均字符高度的 1/3。

准确率优化建议：

多模型融合：对不同 OCR 引擎的结果进行投票或择优，可降低单一模型的错误率。
置信度过滤：丢弃置信度低于 0.5 的识别结果，或标记为需人工复核。
字典约束：对于固定字段（如税号、发票代码），使用正则或字典验证，强制修正格式。
Few-shot 微调：针对特定文档类型（如增值税发票），使用少量标注数据微调 LLM，提升领域适应性。
人机协同：对于高风险字段（如金额），设置置信度阈值，低置信度结果自动触发人工审核流程。

旺道跨平台系统

专业铸造企业商业门户

为私域流量与商业增长提供商弈利器

商城系统

分销系统

CRM系统

智慧门店

供应链系统

订货系统
旺道商弈网课系统

超低成本扩张，赚复利的钱

在线教学+督学+裂变+社交+促销+分销于一体的网校系统，
AI应用软件开发服务

始于需求，终于品质

定制企业级AI小程序、APP与智能系统

量身订做

跨平台

独立部署

专业打造

1321519121813027920428

问题分析

解决原理

第一层：OCR 引擎优化

第二层：结构化重建

第三层：LLM Prompt 优化

第四层：后处理与验证

程序实现与说明

旺道跨平台系统

专业铸造企业商业门户

为私域流量与商业增长提供商弈利器

旺道商弈网课系统

超低成本扩张，赚复利的钱

AI应用软件开发服务

始于需求，终于品质

定制企业级AI小程序、APP与智能系统

量身订做

跨平台

独立部署

专业打造

13215191218
13027920428