概述、实施策略与见解

将PDF文件和扫描图像等非结构化文档转换为结构化或半结构化格式是人工智能的关键部分。然而，由于PDF的复杂性和PDF解析任务的复杂性，这一过程显得神秘莫测。

本系列文章旨在揭开PDF解析的神秘面纱。在上一篇文章中，我们介绍了PDF解析的主要任务，对现有的方法进行了分类，并简要介绍了每种方法。

在本文中，我们重点讨论基于管道的方法。我们从概述开始，然后介绍几个代表性的基于管道的PDF解析框架的实施策略，分享我们获得的见解。

概述

基于管道的方法将解析PDF的任务视为一系列模型或算法的管道，如图1所示。

基于管道的方法可以分为以下五个步骤：

对原始PDF文件进行预处理，以修复模糊或倾斜方向等问题。此步骤包括图像增强、图像方向校正等。
进行布局分析，主要包括两个组成部分：视觉结构分析和语义结构分析。前者识别文档的结构并勾勒出相似区域，而后者则对这些区域进行标注，指定文档类型，如文本、标题、列表、表格、图形等。此步骤还涉及分析页面的阅读顺序。
分别处理布局分析中识别出的不同区域。此过程包括理解表格、识别文本以及识别其他组件，如公式、流程图和特殊符号。
整合先前的结果以恢复页面结构。
输出结构化或半结构化的信息，如Markdown、JSON或HTML。

接下来，本文将讨论几个具有代表性的基于管道的PDF解析框架，并分享从中获得的见解。

Marker

Marker 是一个深度学习模型的管道。它能够将 PDF、EPUB 和 MOBI 文档转换为 Markdown 格式。

整体流程

如图2所示，Marker的整体流程分为以下四个步骤：

步骤1： 将页面划分为块，并使用PyMuPDF和OCR提取文本。相应的代码是：

def convert_single_pdf(
        fname: str,
        model_lst: List,
        max_pages=None,
        metadata: Optional[Dict]=None,
        parallel_factor: int = 1
) -> Tuple[str, Dict]:
    ...
    ...
    doc = pymupdf.open(fname, filetype=filetype)
    if filetype != "pdf":
        conv = doc.convert_to_pdf()
        doc = pymupdf.open("pdf", conv)

    blocks, toc, ocr_stats = get_text_blocks(
        doc,
        tess_lang,
        spell_lang,
        max_pages=max_pages,
        parallel=int(parallel_factor * settings.OCR_PARALLEL_WORKERS)
    )

步骤2： 利用布局分割器对块进行分类，并使用列检测器对块进行排序。相应的代码是：

def convert_single_pdf(
        fname: str,
        model_lst: List,
        max_pages=None,
        metadata: Optional[Dict]=None,
        parallel_factor: int = 1
) -> Tuple[str, Dict]:
    ...
    ...
    # 从列表中解包模型
    texify_model, layoutlm_model, order_model, edit_model = model_lst

    block_types = detect_document_block_types(
        doc,
        blocks,
        layoutlm_model,
        batch_size=int(settings.LAYOUT_BATCH_SIZE * parallel_factor)
    )

    # 查找页眉和页脚
    bad_span_ids = filter_header_footer(blocks)
    out_meta["block_stats"] = {"header_footer": len(bad_span_ids)}

    annotate_spans(blocks, block_types)

    # 如果设置了标志，则转储调试数据
    dump_bbox_debug_data(doc, blocks)

    blocks = order_blocks(
        doc,
        blocks,
        order_model,
        batch_size=int(settings.ORDERER_BATCH_SIZE * parallel_factor)
    )
    ...
    ...

步骤3： 过滤页眉和页脚，修复代码和表格块，并应用Texify模型处理公式。相应的代码是：

def convert_single_pdf(
        fname: str,
        model_lst: List,
        max_pages=None,
        metadata: Optional[Dict]=None,
        parallel_factor: int = 1
) -> Tuple[str, Dict]:
    ...
    ...
    # 修复代码块
    code_block_count = identify_code_blocks(blocks)
    out_meta["block_stats"]["code"] = code_block_count
    indent_blocks(blocks)

    # 修复表格块
    merge_table_blocks(blocks)
    table_count = create_new_tables(blocks)
    out_meta["block_stats"]["table"] = table_count

    for page in blocks:
        for block in page.blocks:
            block.filter_spans(bad_span_ids)
            block.filter_bad_span_types()

    filtered, eq_stats = replace_equations(
        doc,
        blocks,
        block_types,
        texify_model,
        batch_size=int(settings.TEXIFY_BATCH_SIZE * parallel_factor)
    )
    out_meta["block_stats"]["equations"] = eq_stats
    ...
    ...

步骤4： 使用编辑模型进行文本后处理。相应的代码是：

def convert_single_pdf(
        fname: str,
        model_lst: List,
        max_pages=None,
        metadata: Optional[Dict]=None,
        parallel_factor: int = 1
) -> Tuple[str, Dict]:
    ...
    ...
    # 复制以避免更改原始数据
    merged_lines = merge_spans(filtered)
    text_blocks = merge_lines(merged_lines, filtered)
    text_blocks = filter_common_titles(text_blocks)
    full_text = get_full_text(text_blocks)

    # 处理被连接的空块
    full_text = re.sub(r'\n{3,}', '\n\n', full_text)
    full_text = re.sub(r'(\n\s){3,}', '\n\n', full_text)

    # 用 - 替换项目符号字符
    full_text = replace_bullets(full_text)

    # 使用编辑模型后处理文本
    full_text, edit_stats = edit_full_text(
        full_text,
        edit_model,
        batch_size=settings.EDITOR_BATCH_SIZE * parallel_factor
    )
    out_meta["postprocess_stats"] = {"edit": edit_stats}

    return full_text, out_meta

从Marker获得的见解

到目前为止，我们已经介绍了Marker的整体流程。现在，让我们讨论从Marker获得的见解。

见解 1：布局分析可以分为几个子任务。第一个子任务涉及调用PyMuPDF API以获取页面块。

def ocr_entire_page(page, lang: str, spellchecker: Optional[SpellChecker] = None) -> List[Block]:
    if settings.OCR_ENGINE == "tesseract":
        return ocr_entire_page_tess(page, lang, spellchecker)
    elif settings.OCR_ENGINE == "ocrmypdf":
        return ocr_entire_page_ocrmp(page, lang, spellchecker)
    else:
        raise ValueError(f"Unknown OCR engine {settings.OCR_ENGINE}")


def ocr_entire_page_tess(page, lang: str, spellchecker: Optional[SpellChecker] = None) -> List[Block]:
    try:
        full_tp = page.get_textpage_ocr(flags=settings.TEXT_FLAGS, dpi=settings.OCR_DPI, full=True, language=lang)
        blocks = page.get_text("dict", sort=True, flags=settings.TEXT_FLAGS, textpage=full_tp)["blocks"]
        full_text = page.get_text("text", sort=True, flags=settings.TEXT_FLAGS, textpage=full_tp)

        if len(full_text) == 0:
            return []

        # Check if OCR worked. If it didn't, return empty list
        # OCR can fail if there is a scanned blank page with some faint text impressions, for example
        if detect_bad_ocr(full_text, spellchecker):
            return []
    except RuntimeError:
        return []
    return blocks

见解 2：对小型多模态预训练模型进行微调，例如LayoutLMv3 ，对于特定任务非常有益。例如，Marker对LayoutLMv3进行了微调，以获取用于检测块类型的布局分割模型。

def load_layout_model():
    model = LayoutLMv3ForTokenClassification.from_pretrained(
        settings.LAYOUT_MODEL_NAME,
        torch_dtype=settings.MODEL_DTYPE,
    ).to(settings.TORCH_DEVICE_MODEL)

    model.config.id2label = {
        0: "Caption",
        1: "Footnote",
        2: "Formula",
        3: "List-item",
        4: "Page-footer",
        5: "Page-header",
        6: "Picture",
        7: "Section-header",
        8: "Table",
        9: "Text",
        10: "Title"
    }

    model.config.label2id = {v: k for k, v in model.config.id2label.items()}
    return model

用于此微调的数据集来自公共数据集DocLayNet 。

见解 3：识别PDF是单列还是双列以确定阅读顺序在PDF解析中至关重要。Marker的方法涉及对LayoutLMv3进行微调，以创建列检测模型。该模型确定页面上的列数，然后应用中点方法。

def add_column_counts(doc, doc_blocks, model, batch_size):
    for i in range(0, len(doc_blocks), batch_size):
        batch = range(i, min(i + batch_size, len(doc_blocks)))
        rgb_images = []
        bboxes = []
        words = []
        for pnum in batch:
            page = doc[pnum]
            rgb_image, page_bboxes, page_words = get_inference_data(page, doc_blocks[pnum])
            rgb_images.append(rgb_image)
            bboxes.append(page_bboxes)
            words.append(page_words)

        predictions = batch_inference(rgb_images, bboxes, words, model)
        for pnum, prediction in zip(batch, predictions):
            doc_blocks[pnum].column_count = prediction


def order_blocks(doc, doc_blocks: List[Page], model, batch_size=settings.ORDERER_BATCH_SIZE):
    add_column_counts(doc, doc_blocks, model, batch_size)

    for page_blocks in doc_blocks:
        if page_blocks.column_count > 1:
            # Resort blocks based on position
            split_pos = page_blocks.x_start + page_blocks.width / 2
            left_blocks = []
            right_blocks = []
            for block in page_blocks.blocks:
                if block.x_start <= split_pos:
                    left_blocks.append(block)
                else:
                    right_blocks.append(block)
            page_blocks.blocks = left_blocks + right_blocks
    return doc_blocks

这与我在高级RAG 02：揭示PDF解析中的方法类似。

见解 4：可以训练专门的模型来处理数学公式。例如，Marker的模型Texify 使用Donut架构。该模型是在使用来自网络的latex图像和相应方程的Donut模型上进行训练的，包括im2latex 数据集。训练在4台A6000上进行，持续约两天，相当于大约6个周期。

见解 5：一个模型也可以用于后处理。主要思想是训练一个T5模型，以接受几乎最终的文本并通过去除伪影、添加空格和插入新行来进行精细化。

def load_editing_model():
    if not settings.ENABLE_EDITOR_MODEL:
        return None

    model = T5ForTokenClassification.from_pretrained(
            settings.EDITOR_MODEL_NAME,
            torch_dtype=settings.MODEL_DTYPE,
        ).to(settings.TORCH_DEVICE_MODEL)
    model.eval()

    model.config.label2id = {
        "equal": 0,
        "delete": 1,
        "newline-1": 2,
        "space-1": 3,
    }
    model.config.id2label = {v: k for k, v in   model.config.label2id.items()}
    return model

目前，关于后处理器的训练和数据集构建尚未找到更多细节。

Marker的缺点

自然，Marker也有一些缺点：

没有训练或微调专门的布局分析模型，而是使用了PyMuPDF的内置功能。这种方法的有效性值得怀疑。
表格识别效果不太好，无法识别表格标题，效果不如Nougat（将在下一篇文章中深入介绍一种无OCR的小模型解决方案）。例如，图3展示了“Attention Is All You Need ”中表3的表格识别结果。左侧显示原始表格，中间显示使用Marker的结果，右侧则展示了Nougat的结果。

仅支持类似英语的语言。像日语和印地语的语言将无法使用。

PaperMage

Papermage 是一个开源框架，用于分析和处理视觉丰富的结构化科学文档。它提供了清晰直观的抽象，以便在文档中无缝表示和操作文本和视觉元素。

Papermage 将各种自然语言处理（NLP）和计算机视觉（CV）模型集成到一个框架中。它为常见的科学文档处理场景提供了现成的解决方案。

接下来，我们将解释 PaperMage 的原理，并结合源代码讨论其整体过程。然后，我们将讨论从 PaperMage 中获得的见解。

组件

Papermage 主要包括三个部分：

Magelib：一个包含用于表示和处理视觉丰富文档的原语和方法的库，作为多模态结构。
Predictors：一个实现，将各种最先进的科学文档分析模型集成到统一的接口中。即使单个模型是用不同的框架编写或以不同模式运行，这也是可能的。
Recipes：它提供了对经过充分测试的单个模块组合的即插即用访问，通常是单一模态，形成复杂且可扩展的多模态管道。

基础数据类

Magelib 提供了三个基础数据类，用于表示视觉丰富、结构化文档的基本元素：Document、Layers 和 Entities。

Document 和 Layers

图 4 说明了 PaperMage 如何创建和表示文档。

一旦通过各种算法或模型提取了文档结构，PaperMage 将其概念化为注释层，用于存储文本和视觉信息。

我们稍后将分析 recipe.run() 函数的具体执行过程及其源代码。

Entities

如图 5 所示，实体表示一个多模态内容单元。

如何管理不连续的单元，例如跨列/页面的句子或被浮动图形/脚注中断的句子？

PaperMage 使用两个成员变量：spans 和 boxes。如图 5 所示，spans 确定句子的文本在所有符号中的位置，而 boxes 则映射其在页面上的视觉坐标。这种方法提供了灵活性，以适应微妙的布局差异。

此外，我们可以以不同的方式灵活访问实体，如图 6 所示。

为了更深入地理解 PaperMage，我们从一个 PDF 解析的具体示例开始，并从那里进行详细阐述。

整体流程和代码分析

测试代码如下。

from papermage.recipes import CoreRecipe

core_recipe = CoreRecipe()

doc = core_recipe.run("YOUR_PDF_PATH")

首先，core_recipe = CoreRecipe() 将进入类 CoreRecipe 的构造函数，在那里将进行相关库和模型的初始化。

class CoreRecipe(Recipe):
    def __init__(
        self,
        ivila_predictor_path: str = "allenai/ivila-row-layoutlm-finetuned-s2vl-v2",
        bio_roberta_predictor_path: str = "allenai/vila-roberta-large-s2vl-internal",
        svm_word_predictor_path: str = "https://ai2-s2-research-public.s3.us-west-2.amazonaws.com/mmda/models/svm_word_predictor.tar.gz",
        dpi: int = 72,
    ):
        self.logger = logging.getLogger(self.__class__.__name__)
        self.dpi = dpi

        self.logger.info("实例化配方中...")
        self.parser = PDFPlumberParser()
        self.rasterizer = PDF2ImageRasterizer()

        # with warnings.catch_warnings():
        #     warnings.simplefilter("ignore")
        #     self.word_predictor = SVMWordPredictor.from_path(svm_word_predictor_path)

        self.publaynet_block_predictor = LPEffDetPubLayNetBlockPredictor.from_pretrained()
        self.ivila_predictor = IVILATokenClassificationPredictor.from_pretrained(ivila_predictor_path)
        self.sent_predictor = PysbdSentencePredictor()
        self.logger.info("完成配方实例化")

由于 class Recipe 是 CoreRecipe 的父类，因此 core_recipe.run() 函数将调用 Recipe::run() 。

class Recipe:
    @abstractmethod
    def run(self, input: Any) -> Document:
        if isinstance(input, Path):
            if input.suffix == ".pdf":
                return self.from_pdf(pdf=input)
            if input.suffix == ".json":
                return self.from_json(doc=input)

            raise NotImplementedError("不支持的文件类型。")

        if isinstance(input, Document):
            return self.from_doc(doc=input)

        if isinstance(input, str):
            if os.path.exists(input):
                input = Path(input)
                return self.run(input=input)
            else:
                return self.from_str(text=input)

        raise NotImplementedError("不支持的文档输入。")

然后将进入 class CoreRecipe:: from_pdf() 和 class CoreRecipe:: from_doc() ：

class CoreRecipe(Recipe):
    ...
    ...
    def from_pdf(self, pdf: Path) -> Document:
        self.logger.info("正在解析文档...")
        doc = self.parser.parse(input_pdf_path=pdf)

        self.logger.info("正在栅格化文档...")
        images = self.rasterizer.rasterize(input_pdf_path=pdf, dpi=self.dpi)
        doc.annotate_images(images=list(images))
        self.rasterizer.attach_images(images=images, doc=doc)
        return self.from_doc(doc=doc)

    def from_doc(self, doc: Document) -> Document:
        # self.logger.info("正在预测单词...")
        # words = self.word_predictor.predict(doc=doc)
        # doc.annotate_layer(name=WordsFieldName, entities=words)

        self.logger.info("正在预测句子...")
        sentences = self.sent_predictor.predict(doc=doc)
        doc.annotate_layer(name=SentencesFieldName, entities=sentences)

        self.logger.info("正在预测区块...")
        with warnings.catch_warnings():
            warnings.simplefilter("ignore")
            blocks = self.publaynet_block_predictor.predict(doc=doc)
        doc.annotate_layer(name=BlocksFieldName, entities=blocks)

        self.logger.info("正在预测图形和表格...")
        figures = []
        tables = []
        for block in blocks:
            if block.metadata.type == "Figure":
                figure = Entity(boxes=block.boxes)
                figures.append(figure)
            elif block.metadata.type == "Table":
                table = Entity(boxes=block.boxes)
                tables.append(table)
        doc.annotate_layer(name=FiguresFieldName, entities=figures)
        doc.annotate_layer(name=TablesFieldName, entities=tables)

        # self.logger.info("正在预测 vila...")
        vila_entities = self.ivila_predictor.predict(doc=doc)
        doc.annotate_layer(name="vila_entities", entities=vila_entities)

        for entity in vila_entities:
            entity.boxes = [
                Box.create_enclosing_box(
                    [b for t in doc.intersect_by_span(entity, name=TokensFieldName) for b in t.boxes]
                )
            ]
            # entity.text = make_text(entity=entity, document=doc)
        preds = group_by(entities=vila_entities, metadata_field="label", metadata_values_map=VILA_LABELS_MAP)
        doc.annotate(*preds)
        return doc

整体流程如图 7 所示：

图 7 表明 PaperMage 的处理流程也遵循管道方法。

最初，使用 PDFPlumber 库进行布局分析。随后，基于布局分析结果，使用专业算法或模型解析页面上的其他实体。这些实体包括句子、图形、表格、标题等。

接下来，我们将集中讨论三种算法或模型：

句子拆分
布局结构分析
逻辑结构分析。

句子分割

用于句子分割的算法是 PySBD ，这是一个基于规则的句子边界消歧义的 Python 包。

输入由一系列标记组成。输出提供每个句子的范围。

[
Unannotated Entity: {'spans': [[0, 212]]}, 
Unannotated Entity: {'spans': [[212, 367]]},  
…
]

布局结构分析

用于分析页面布局结构的模型是 LPEffDetPubLayNetBlockPredictor。这是一个基于深度学习的高效目标检测模型，由 LayoutParser 提供。其主要功能是将文档分割成视觉块区域。

输入是页面的图像，称为 doc.images。输出是一个 class box 对象及其各个块的相应类型。一个 box 包括左上顶点的 x 坐标、左上顶点的 y 坐标、页面宽度、页面高度和页面编号。

[
Unannotated Entity: {'boxes': [[0.5179840190298606, 0.752760137345049, 0.3682081491355128, 0.15176369855069774, 0]], 'metadata': {'type': 'Text'}}, 
Unannotated Entity: {'boxes': [[0.5145780320135539, 0.5080924136055337, 0.3675624668198144, 0.23725746136663078, 0]], 'metadata': {'type': 'Text'}}, 
…
]

逻辑结构分析

用于分析文档逻辑结构的模型是 IVILATokenClassificationPredictor。它将文档分割成组织单元，例如标题、摘要、正文、脚注、图表等。

提供的输入是以字典形式呈现的页面级数据。

{
        'words': ['word1', 'word2', ...],
        'bbox': [[x1, y1, x2, y2], [x1, y1, x2, y2], ...],
        'block_ids': [0, 0, 0, 1 ...],
        'line_ids': [0, 1, 1, 2 ...],
        'labels': [0, 0, 0, 1 ...], # 可能为空
    }

输出是每个实体的范围。

[
Unannotated Entity: {'spans': [[0, 80]], 'metadata': {'label': 'Title'}}, 
Unannotated Entity: {'spans': [[81, 157]], 'metadata': {'label': 'Author'}}, 
Unannotated Entity: {'spans': [[158, 215]], 'metadata': {'label': 'Paragraph'}}, 
...
]

关于PaperMage的见解与讨论

PDF解析的抽象

对于PDF解析任务，PaperMage提出的抽象方法是有效的，即将整个PDF分为文档、层和实体等类型，这有助于分类和管理。

可扩展性

PaperMage设计了一个易于扩展的框架，使得开发者可以方便地进行后续开发。

例如，要添加一个自定义预测器，我们只需从基类BasePredictor继承并重写_predict()函数。

from .base_predictor import BasePredictor

class YOUR_NEW_Predictor(BasePredictor):
    ...
    ...
    def _predict(self, doc: Document) -> List[YOUR_RET_TYPE]:
    ...
    ...

关于并行化

图7表明通过并行化有潜力改善PaperMage，这是一个可行的增强方向。

尽管当前版本的PaperMage 不包含任何与并行处理相关的代码，但添加并行处理逻辑可以显著提高PDF解析的效率。

Unstructured

Unstructured 是一个开源的非结构化数据预处理工具。我们在上一篇文章中介绍了它的整体流程。

接下来，我们将主要讨论从非结构化框架中获得的见解和经验，特别是它如何帮助我们开发自己的 PDF 解析工具。

关于布局分析

对非结构化框架的布局分析进行了详细研究。

如果我们设置 strategy='hi_res'，它将利用 YOLOX 或 detectron2 等模型进行布局分析。这与 PDFMiner 结合，以进行额外的检测。两者的结果合并以生成最终布局，如图8所示。

图9和图10展示了BERT论文第16页的布局分析结果的可视化，图片中的框代表每个区域的范围。图9中显示的目标检测模型的结果更为准确，集成了更多的表格和图像。相反，图10中显示的PDFMiner的检测结果则将表格和图像的内容分开。

合并布局的具体代码如下，它具有一个双重循环，用于评估PDFMiner检测结果（extracted_layout）和目标检测模型结果（inferred_layout）中每个区域之间的关系，然后决定是否合并。

def merge_inferred_layout_with_extracted_layout(
    inferred_layout: Collection[LayoutElement],
    extracted_layout: Collection[TextRegion],
    page_image_size: tuple,
    same_region_threshold: float = inference_config.LAYOUT_SAME_REGION_THRESHOLD,
    subregion_threshold: float = inference_config.LAYOUT_SUBREGION_THRESHOLD,
) -> List[LayoutElement]:
    """Merge two layouts to produce a single layout."""
    extracted_elements_to_add: List[TextRegion] = []
    inferred_regions_to_remove = []
    w, h = page_image_size
    full_page_region = Rectangle(0, 0, w, h)
    for extracted_region in extracted_layout:
        extracted_is_image = isinstance(extracted_region, ImageTextRegion)
        if extracted_is_image:
            # Skip extracted images for this purpose, we don't have the text from them and they
            # don't provide good text bounding boxes.

            is_full_page_image = region_bounding_boxes_are_almost_the_same(
                extracted_region.bbox,
                full_page_region,
                FULL_PAGE_REGION_THRESHOLD,
            )

            if is_full_page_image:
                continue
        region_matched = False
        for inferred_region in inferred_layout:
            if inferred_region.source in CHIPPER_VERSIONS:
                continue
            ...
            ...

关于自定义

在非结构化框架中有许多中间结果，这使得自定义变得容易。

在上一篇文章中，我们讨论了处理非结构化数据的三个挑战：

解析表格
重新排列检测到的块，特别是对于双列PDF
提取多级标题

最后两个挑战可以通过修改中间结构来解决。作为示例，图11展示了BERT论文第二页的最终布局。

同时，我们可以轻松获得布局分析的可用结果：

[

LayoutElement(bbox=Rectangle(x1=851.1539916992188, y1=181.15073777777613, x2=1467.844970703125, y2=587.8204599999975), text='These approaches have been generalized to coarser granularities, such as sentence embed- dings (Kiros et al., 2015; Logeswaran and Lee, 2018) or paragraph embeddings (Le and Mikolov, 2014). To train sentence representations, prior work has used objectives to rank candidate next sentences (Jernite et al., 2017; Logeswaran and Lee, 2018), left-to-right generation of next sen- tence words given a representation of the previous sentence (Kiros et al., 2015), or denoising auto- encoder derived objectives (Hill et al., 2016). ', source=<Source.YOLOX: 'yolox'>, type='Text', prob=0.9519357085227966, image_path=None, parent=None), 

LayoutElement(bbox=Rectangle(x1=196.5296173095703, y1=181.1507377777777, x2=815.468994140625, y2=512.548237777777), text='word based only on its context. Unlike left-to- right language model pre-training, the MLM ob- jective enables the representation to fuse the left and the right context, which allows us to pre- In addi- train a deep bidirectional Transformer. tion to the masked language model, we also use a “next sentence prediction” task that jointly pre- trains text-pair representations. The contributions of our paper are as follows: ', source=<Source.YOLOX: 'yolox'>, type='Text', prob=0.9517233967781067, image_path=None, parent=None), 

LayoutElement(bbox=Rectangle(x1=200.22352600097656, y1=539.1451822222216, x2=825.0242919921875, y2=870.542682222221), text='• We demonstrate the importance of bidirectional pre-training for language representations. Un- like Radford et al. (2018), which uses unidirec- tional language models for pre-training, BERT uses masked language models to enable pre- trained deep bidirectional representations. This is also in contrast to Peters et al. (2018a), which uses a shallow concatenation of independently trained left-to-right and right-to-left LMs. ', source=<Source.YOLOX: 'yolox'>, type='List-item', prob=0.9414362907409668, image_path=None, parent=None), 

LayoutElement(bbox=Rectangle(x1=851.8727416992188, y1=599.8257377777753, x2=1468.0499267578125, y2=1420.4982377777742), text='ELMo and its predecessor (Peters et al., 2017, 2018a) generalize traditional word embedding re- search along a different dimension. They extract context-sensitive features from a left-to-right and a right-to-left language model. The contextual rep- resentation of each token is the concatenation of the left-to-right and right-to-left representations. When integrating contextual word embeddings with existing task-speciﬁc architectures, ELMo advances the state of the art for several major NLP benchmarks (Peters et al., 2018a) including ques- tion answering (Rajpurkar et al., 2016), sentiment analysis (Socher et al., 2013), and named entity recognition (Tjong Kim Sang and De Meulder, 2003). Melamud et al. (2016) proposed learning contextual representations through a task to pre- dict a single word from both left and right context using LSTMs. Similar to ELMo, their model is feature-based and not deeply bidirectional. Fedus et al. (2018) shows that the cloze task can be used to improve the robustness of text generation mod- els. ', source=<Source.YOLOX: 'yolox'>, type='Text', prob=0.938507616519928, image_path=None, parent=None), 

LayoutElement(bbox=Rectangle(x1=199.3734130859375, y1=900.5257377777765, x2=824.69873046875, y2=1156.648237777776), text='• We show that pre-trained representations reduce the need for many heavily-engineered task- speciﬁc architectures. BERT is the ﬁrst ﬁne- tuning based representation model that achieves state-of-the-art performance on a large suite of sentence-level and token-level tasks, outper- forming many task-speciﬁc architectures. ', source=<Source.YOLOX: 'yolox'>, type='List-item', prob=0.9461237788200378, image_path=None, parent=None), 

LayoutElement(bbox=Rectangle(x1=195.5695343017578, y1=1185.526123046875, x2=815.9393920898438, y2=1330.3272705078125), text='• BERT advances the state of the art for eleven NLP tasks. The code and pre-trained mod- els are available at https://github.com/ google-research/bert. ', source=<Source.YOLOX: 'yolox'>, type='List-item', prob=0.9213815927505493, image_path=None, parent=None), 

LayoutElement(bbox=Rectangle(x1=195.33956909179688, y1=1360.7886962890625, x2=447.47264000000007, y2=1397.038330078125), text='2 Related Work ', source=<Source.YOLOX: 'yolox'>, type='Section-header', prob=0.8663332462310791, image_path=None, parent=None), 

LayoutElement(bbox=Rectangle(x1=197.7477264404297, y1=1419.3353271484375, x2=817.3308715820312, y2=1527.54443359375), text='There is a long history of pre-training general lan- guage representations, and we brieﬂy review the most widely-used approaches in this section. ', source=<Source.YOLOX: 'yolox'>, type='Text', prob=0.928022563457489, image_path=None, parent=None), 

LayoutElement(bbox=Rectangle(x1=851.0028686523438, y1=1468.341394166663, x2=1420.4693603515625, y2=1498.6444497222187), text='2.2 Unsupervised Fine-tuning Approaches ', source=<Source.YOLOX: 'yolox'>, type='Section-header', prob=0.8346447348594666, image_path=None, parent=None), 

LayoutElement(bbox=Rectangle(x1=853.5444444444446, y1=1526.3701822222185, x2=1470.989990234375, y2=1669.5843488888852), text='As with the feature-based approaches, the ﬁrst works in this direction only pre-trained word em- (Col- bedding parameters from unlabeled text lobert and Weston, 2008). ', source=<Source.YOLOX: 'yolox'>, type='Text', prob=0.9344717860221863, image_path=None, parent=None), 

LayoutElement(bbox=Rectangle(x1=200.00000000000009, y1=1556.2037353515625, x2=799.1743774414062, y2=1588.031982421875), text='2.1 Unsupervised Feature-based Approaches ', source=<Source.YOLOX: 'yolox'>, type='Section-header', prob=0.8317819237709045, image_path=None, parent=None), 

LayoutElement(bbox=Rectangle(x1=198.64227294921875, y1=1606.3146266666645, x2=815.2886352539062, y2=2125.895459999998), text='Learning widely applicable representations of words has been an active area of research for decades, including non-neural (Brown et al., 1992; Ando and Zhang, 2005; Blitzer et al., 2006) and neural (Mikolov et al., 2013; Pennington et al., 2014) methods. Pre-trained word embeddings are an integral part of modern NLP systems, of- fering signiﬁcant improvements over embeddings learned from scratch (Turian et al., 2010). To pre- train word embedding vectors, left-to-right lan- guage modeling objectives have been used (Mnih and Hinton, 2009), as well as objectives to dis- criminate correct from incorrect words in left and right context (Mikolov et al., 2013). ', source=<Source.YOLOX: 'yolox'>, type='Text', prob=0.9450697302818298, image_path=None, parent=None), 

LayoutElement(bbox=Rectangle(x1=853.4905395507812, y1=1681.5868488888855, x2=1467.8729248046875, y2=2125.8954599999965), text='More recently, sentence or document encoders which produce contextual token representations have been pre-trained from unlabeled text and ﬁne-tuned for a supervised downstream task (Dai and Le, 2015; Howard and Ruder, 2018; Radford et al., 2018). The advantage of these approaches is that few parameters need to be learned from scratch. At least partly due to this advantage, OpenAI GPT (Radford et al., 2018) achieved pre- viously state-of-the-art results on many sentence- level tasks from the GLUE benchmark (Wang language model- Left-to-right et al., 2018a). ', source=<Source.YOLOX: 'yolox'>, type='Text', prob=0.9476840496063232, image_path=None, parent=None)

]

利用上述信息，我们可以轻松执行排序和提取多级标题等任务。

因此，在开发我们自己的PDF解析工具时，我们应尽量保留尽可能多的有用中间信息和元数据。

关于表格检测与识别

在表格检测与识别中，使用了 Table Transformer 作为无结构框架的工具。

Table Transformer 模型在 PubTables-1M: Towards comprehensive table extraction from unstructured documents 中提出。本文介绍了一个新的数据集 PubTables-1M，旨在从非结构化文档中提取表格，并进行表格结构识别和功能分析，如图 12 所示。

Table Transformer 在 PubTables-1M 数据集上进行训练，基于 DETR 模型，执行表格检测和表格结构识别等任务。

有关表格处理效果，请参考我的上一篇文章。

关于公式检测与识别

非结构化框架缺乏专门的公式检测与识别模块，导致性能平庸，如图13所示。

结论

本文概述了基于管道的方法在PDF解析中的应用。它通过三个具有代表性的框架作为例子探讨了这种方法，提供了深入的介绍并分享了从中获得的见解。

总之，

尽管Marker有一些缺点，但它是一个轻量且更快的工具。
PaperMage主要设计用于科学文档，其出色的可扩展性支持未来的发展。
Unstructured是一个全面的基于管道的PDF解析框架。它的优势在于详细的布局分析和强大的可定制性。

一般来说，基于管道的PDF解析方法是可解释的且易于定制，使其成为一种广泛使用的PDF解析方法。然而，它的有效性在很大程度上依赖于过程中使用的每个模型或算法的性能。因此，必须仔细设计每个模型的训练数据和结构。

如果您对PDF解析或文档智能感兴趣，请随时查看我的其他文章。

最后，如果本文中有任何错误或遗漏，或者您有任何想法要分享，请在评论区指出。

Barry's Home

解密 PDF 解析 02基于管道的方法