包含关键点、图表和代码

对于RAG，从文档中提取信息是一个不可避免的场景。确保从源文件中有效提取内容对于提高最终输出的质量至关重要。

这一过程的重要性不容小觑。在实施RAG时，解析过程中信息提取不充分可能导致对PDF文件所含信息的理解和利用受限。

RAG中解析过程的位置如图1所示：

在实际工作中，非结构化数据远比结构化数据丰富。如果这些海量数据无法被解析，其巨大的价值将无法实现。

在非结构化数据中，PDF文档占据了大部分。有效处理PDF文档也能极大地辅助管理其他类型的非结构化文档。

本文主要介绍解析PDF文件的方法。它提供了算法和建议，以有效解析PDF文档并尽可能提取有用信息。

PDF解析的挑战

PDF文档代表了非结构化文档的典型，然而，从PDF文档中提取信息是一个具有挑战性的过程。

与其说PDF是一种数据格式，更准确地描述它是一系列打印指令的集合。一个PDF文件由一系列指令组成，这些指令指导PDF阅读器或打印机在屏幕或纸张上的何处以及如何显示符号。这与HTML和docx等文件格式形成对比，后者使用诸如<p>、<w:p>、<table>和<w:tbl>等标签来组织不同的逻辑结构，如图2所示：

解析PDF文档的挑战在于准确提取整个页面的布局，并将内容（包括表格、标题、段落和图像）转换为文档的文本表示。这一过程涉及处理文本提取的不准确性、图像识别以及表格中行列关系的混淆。

如何解析PDF文档

通常，解析PDF有三种方法：

基于规则的方法：根据文档的组织特征确定每个部分的样式和内容。然而，这种方法的通用性不强，因为PDF的类型和布局繁多，无法用预定义的规则覆盖所有情况。
基于深度学习模型的方法：如结合对象检测和OCR模型的流行解决方案。
基于多模态大型模型解析复杂结构或提取PDF中的关键信息。

基于规则的方法

最具代表性的工具之一是 pypdf ，这是一个广泛使用的基于规则的解析器。它是 LangChain 和 LlamaIndex 中用于解析 PDF 文件的标准方法。

以下是使用 pypdf 尝试解析“Attention Is All You Need ”论文第 6 页的示例。原始页面如图 3 所示。

代码如下：

import PyPDF2
filename = "/Users/Florian/Downloads/1706.03762.pdf"
pdf_file = open(filename, 'rb')

reader = PyPDF2.PdfReader(pdf_file)

page_num = 5
page = reader.pages[page_num]
text = page.extract_text()

print('--------------------------------------------------')
print(text)

pdf_file.close()

执行结果如下（为简洁起见省略其余部分）：

(py) Florian:~ Florian$ pip list | grep pypdf
pypdf                    3.17.4
pypdfium2                4.26.0

(py) Florian:~ Florian$ python /Users/Florian/Downloads/pypdf_test.py
--------------------------------------------------
表 1：不同层类型的最大路径长度、每层复杂度和最小顺序操作数。n 是序列长度，d 是表示维度，k 是卷积核大小，r 是受限自注意力中的邻域大小。
层类型 每层复杂度 顺序操作数 最大路径长度
自注意力 O(n^2·d) O(1) O(1)
循环 O(n·d^2) O(n) O(n)
卷积 O(k·n·d^2) O(1) O(logk(n))
受限自注意力 O(r·n·d) O(1) O(n/r)
3.5 位置编码
由于我们的模型不包含循环和卷积，为了使模型利用序列的顺序，我们必须注入一些关于序列中标记的相对或绝对位置的信息。为此，我们在编码器和解码器堆栈的底部将“位置编码”添加到输入嵌入中。位置编码与嵌入具有相同的维度 dmodel，以便两者可以相加。位置编码有很多选择，可以是学习的或固定的 [9]。
在这项工作中，我们使用不同频率的正弦和余弦函数：
PE(pos,2i)=sin(pos/10000^2i/dmodel)
PE(pos,2i+1)=cos(pos/10000^2i/dmodel)
其中 pos 是位置，i 是维度。也就是说，位置编码的每个维度对应一个正弦波。波长形成从 2π 到 10000·2π 的几何级数。我们选择这个函数是因为我们假设它可以让模型轻松学习通过相对位置进行关注，因为对于任何固定偏移 k，PEpos+k 可以表示为 PEpos 的线性函数。
...
...
...

根据 PyPDF 的检测结果，可以观察到它将 PDF 中的字符序列序列化为一个长序列，而不保留结构信息。换句话说，它将文档的每一行视为由换行符“\n”分隔的序列，这使得无法准确识别段落或表格。

这一限制是基于规则方法的固有特性。

基于深度学习模型的方法

这种方法的优势在于其能够准确识别整个文档的布局，包括表格和段落，甚至能理解表格内部的结构。这意味着它可以将文档分割成定义明确、完整的信息单元，同时保留预期的意义和结构。

然而，也存在一些局限性。目标检测和OCR阶段可能会耗时较长。因此，建议使用GPU或其他加速设备，并采用多进程和多线程进行处理。

这种方法涉及目标检测和OCR模型，我测试了几个具有代表性的开源框架：

Unstructured ：它已被集成到langchain中。使用infer_table_structure=True的hi_res策略进行表格识别效果良好。然而，fast策略表现不佳，因为它不使用目标检测模型，错误地识别了许多图像和表格。
Layout-parser ：如果需要识别复杂的结构化PDF，建议使用最大的模型以获得更高的准确性，尽管可能会稍慢一些。此外，似乎Layout-parser的模型在过去两年内没有更新。
PP-StructureV2 ：使用多种模型组合进行文档分析，性能中等偏上。架构如图4所示：

除了开源工具外，还有如ChatDOC这样的付费工具，采用基于布局的识别+OCR方法来解析PDF文档。

接下来，我们将解释如何使用开源的unstructured 框架解析PDF，解决三个关键挑战。

挑战1：如何从表格和图像中提取数据

这里，我们以unstructured 框架为例。检测到的表格数据可以直接导出为HTML。代码如下：

from unstructured.partition.pdf import partition_pdf

filename = "/Users/Florian/Downloads/Attention_Is_All_You_Need.pdf"

# infer_table_structure=True 自动选择 hi_res 策略
elements = partition_pdf(filename=filename, infer_table_structure=True)
tables = [el for el in elements if el.category == "Table"]

print(tables[0].text)
print('--------------------------------------------------')
print(tables[0].metadata.text_as_html)

我追踪了**partition_pdf**函数的内部流程。图5是一个基本流程图。

代码的运行结果如下：

Layer Type Self-Attention Recurrent Convolutional Self-Attention (restricted) Complexity per Layer O(n2 · d) O(n · d2) O(k · n · d2) O(r · n · d) Sequential Maximum Path Length Operations O(1) O(n) O(1) O(1) O(1) O(n) O(logk(n)) O(n/r)
--------------------------------------------------
<table><thead><th>Layer Type</th><th>Complexity per Layer</th><th>Sequential Operations</th><th>Maximum Path Length</th></thead><tr><td>Self-Attention</td><td>O(n? - d)</td><td>O(1)</td><td>O(1)</td></tr><tr><td>Recurrent</td><td>O(n- d?)</td><td>O(n)</td><td>O(n)</td></tr><tr><td>Convolutional</td><td>O(k-n-d?)</td><td>O(1)</td><td>O(logy(n))</td></tr><tr><td>Self-Attention (restricted)</td><td>O(r-n-d)</td><td>ol)</td><td>O(n/r)</td></tr></table>

复制HTML标签并保存为HTML文件，然后用Chrome打开，如图6所示：

可以看出，unstructured的算法在很大程度上还原了整个表格。

挑战2：如何重新排列检测到的块？特别是对于双栏PDF。

处理双栏PDF时，以论文“BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding ”为例。阅读顺序如图中的红色箭头所示：

识别布局后，unstructured框架会将每一页分割成多个矩形块，如图8所示。

每个矩形块的详细信息可以以下列格式获取：

[

LayoutElement(bbox=Rectangle(x1=851.1539916992188, y1=181.15073777777613, x2=1467.844970703125, y2=587.8204599999975), text='These approaches have been generalized to coarser granularities, such as sentence embed- dings (Kiros et al., 2015; Logeswaran and Lee, 2018) or paragraph embeddings (Le and Mikolov, 2014). To train sentence representations, prior work has used objectives to rank candidate next sentences (Jernite et al., 2017; Logeswaran and Lee, 2018), left-to-right generation of next sen- tence words given a representation of the previous sentence (Kiros et al., 2015), or denoising auto- encoder derived objectives (Hill et al., 2016). ', source=<Source.YOLOX: 'yolox'>, type='Text', prob=0.9519357085227966, image_path=None, parent=None), 

LayoutElement(bbox=Rectangle(x1=196.5296173095703, y1=181.1507377777777, x2=815.468994140625, y2=512.548237777777), text='word based only on its context. Unlike left-to- right language model pre-training, the MLM ob- jective enables the representation to fuse the left and the right context, which allows us to pre- In addi- train a deep bidirectional Transformer. tion to the masked language model, we also use a “next sentence prediction” task that jointly pre- trains text-pair representations. The contributions of our paper are as follows: ', source=<Source.YOLOX: 'yolox'>, type='Text', prob=0.9517233967781067, image_path=None, parent=None), 

LayoutElement(bbox=Rectangle(x1=200.22352600097656, y1=539.1451822222216, x2=825.0242919921875, y2=870.542682222221), text='• We demonstrate the importance of bidirectional pre-training for language representations. Un- like Radford et al. (2018), which uses unidirec- tional language models for pre-training, BERT uses masked language models to enable pre- trained deep bidirectional representations. This is also in contrast to Peters et al. (2018a), which uses a shallow concatenation of independently trained left-to-right and right-to-left LMs. ', source=<Source.YOLOX: 'yolox'>, type='List-item', prob=0.9414362907409668, image_path=None, parent=None), 

LayoutElement(bbox=Rectangle(x1=851.8727416992188, y1=599.8257377777753, x2=1468.0499267578125, y2=1420.4982377777742), text='ELMo and its predecessor (Peters et al., 2017, 2018a) generalize traditional word embedding re- search along a different dimension. They extract context-sensitive features from a left-to-right and a right-to-left language model. The contextual rep- resentation of each token is the concatenation of the left-to-right and right-to-left representations. When integrating contextual word embeddings with existing task-speciﬁc architectures, ELMo advances the state of the art for several major NLP benchmarks (Peters et al., 2018a) including ques- tion answering (Rajpurkar et al., 2016), sentiment analysis (Socher et al., 2013), and named entity recognition (Tjong Kim Sang and De Meulder, 2003). Melamud et al. (2016) proposed learning contextual representations through a task to pre- dict a single word from both left and right context using LSTMs. Similar to ELMo, their model is feature-based and not deeply bidirectional. Fedus et al. (2018) shows that the cloze task can be used to improve the robustness of text generation mod- els. ', source=<Source.YOLOX: 'yolox'>, type='Text', prob=0.938507616519928, image_path=None, parent=None), 

LayoutElement(bbox=Rectangle(x1=199.3734130859375, y1=900.5257377777765, x2=824.69873046875, y2=1156.648237777776), text='• We show that pre-trained representations reduce the need for many heavily-engineered task- speciﬁc architectures. BERT is the ﬁrst ﬁne- tuning based representation model that achieves state-of-the-art performance on a large suite of sentence-level and token-level tasks, outper- forming many task-speciﬁc architectures. ', source=<Source.YOLOX: 'yolox'>, type='List-item', prob=0.9461237788200378, image_path=None, parent=None), 

LayoutElement(bbox=Rectangle(x1=195.5695343017578, y1=1185.526123046875, x2=815.9393920898438, y2=1330.3272705078125), text='• BERT advances the state of the art for eleven NLP tasks. The code and pre-trained mod- els are available at https://github.com/ google-research/bert. ', source=<Source.YOLOX: 'yolox'>, type='List-item', prob=0.9213815927505493, image_path=None, parent=None), 

LayoutElement(bbox=Rectangle(x1=195.33956909179688, y1=1360.7886962890625, x2=447.47264000000007, y2=1397.038330078125), text='2 Related Work ', source=<Source.YOLOX: 'yolox'>, type='Section-header', prob=0.8663332462310791, image_path=None, parent=None), 

LayoutElement(bbox=Rectangle(x1=197.7477264404297, y1=1419.3353271484375, x2=817.3308715820312, y2=1527.54443359375), text='There is a long history of pre-training general lan- guage representations, and we brieﬂy review the most widely-used approaches in this section. ', source=<Source.YOLOX: 'yolox'>, type='Text', prob=0.928022563457489, image_path=None, parent=None), 

LayoutElement(bbox=Rectangle(x1=851.0028686523438, y1=1468.341394166663, x2=1420.4693603515625, y2=1498.6444497222187), text='2.2 Unsupervised Fine-tuning Approaches ', source=<Source.YOLOX: 'yolox'>, type='Section-header', prob=0.8346447348594666, image_path=None, parent=None), 

LayoutElement(bbox=Rectangle(x1=853.5444444444446, y1=1526.3701822222185, x2=1470.989990234375, y2=1669.5843488888852), text='As with the feature-based approaches, the ﬁrst works in this direction only pre-trained word em- (Col- bedding parameters from unlabeled text lobert and Weston, 2008). ', source=<Source.YOLOX: 'yolox'>, type='Text', prob=0.9344717860221863, image_path=None, parent=None), 

LayoutElement(bbox=Rectangle(x1=200.00000000000009, y1=1556.2037353515625, x2=799.1743774414062, y2=1588.031982421875), text='2.1 Unsupervised Feature-based Approaches ', source=<Source.YOLOX: 'yolox'>, type='Section-header', prob=0.8317819237709045, image_path=None, parent=None), 

LayoutElement(bbox=Rectangle(x1=198.64227294921875, y1=1606.3146266666645, x2=815.2886352539062, y2=2125.895459999998), text='Learning widely applicable representations of words has been an active area of research for decades, including non-neural (Brown et al., 1992; Ando and Zhang, 2005; Blitzer et al., 2006) and neural (Mikolov et al., 2013; Pennington et al., 2014) methods. Pre-trained word embeddings are an integral part of modern NLP systems, of- fering signiﬁcant improvements over embeddings learned from scratch (Turian et al., 2010). To pre- train word embedding vectors, left-to-right lan- guage modeling objectives have been used (Mnih and Hinton, 2009), as well as objectives to dis- criminate correct from incorrect words in left and right context (Mikolov et al., 2013). ', source=<Source.YOLOX: 'y

```python
        (x_1, y_1) --------
            |             |
            |             |
            |             |
            ---------- (x_2, y_2)

此时，您可以选择调整页面的阅读顺序。Unstructured 自带一个排序算法，但在处理双栏情况时，我发现排序结果并不十分令人满意。

因此，有必要设计一个算法。最简单的方法是先按左上顶点的水平坐标排序，如果水平坐标相同，则再按垂直坐标排序。其伪代码如下：

layout.sort(key=lambda z: (z.bbox.x1, z.bbox.y1, z.bbox.x2, z.bbox.y2))

然而，我们发现即使是在同一栏中的块，其水平坐标也可能有所不同。如图 9 所示，紫色线条块的水平坐标 bbox.x1 实际上更靠左。排序时，它会排在绿色线条块之前，这显然违反了阅读顺序。

在这种情况下，可以使用以下算法：

首先，对所有左上 x 坐标 **x1** 进行排序，得到 **x1_min**
然后，对所有右下 x 坐标 **x2** 进行排序，得到 **x2_max**
接下来，确定页面中心线的 x 坐标为：

x1_min = min([el.bbox.x1 for el in layout])
x2_max = max([el.bbox.x2 for el in layout])
mid_line_x_coordinate = (x2_max + x1_min) /  2

接下来，**如果 bbox.x1 < mid_line_x_coordinate**，则该块被归类为左栏的一部分。否则，它被视为右栏的一部分。

分类完成后，根据 y 坐标对每栏中的块进行排序。最后，将右栏连接到左栏的右侧。

left_column = []
right_column = []
for el in layout:
    if el.bbox.x1 < mid_line_x_coordinate:
        left_column.append(el)
    else:
        right_column.append(el)

left_column.sort(key = lambda z: z.bbox.y1)
right_column.sort(key = lambda z: z.bbox.y1)
sorted_layout = left_column + right_column

值得一提的是，这一改进同样适用于单栏 PDF。

挑战 3：如何提取多级标题

提取标题，包括多级标题的目的，是为了提高 LLM 回答的准确性。

例如，如果用户想知道图 9 中第 2.1 节的主要内容，通过准确提取第 2.1 节的标题，并将其与相关内容一起作为上下文发送给 LLM，最终答案的准确性将显著提高。

该算法仍然依赖于图 9 中显示的布局块。我们可以提取 **type=’Section-header’** 的块，并计算高度差 (**bbox.y2 — bbox.y1**)。高度差最大的块对应一级标题，其次是二级标题，然后是三级标题。

基于多模态大模型的PDF复杂结构解析

多模态模型爆发后，我们也可以利用多模态模型来解析表格。有几种选择：

检索相关图像（PDF页面）并发送至GPT4-V以响应查询。
将每个PDF页面视为图像，让GPT4-V对每个页面进行图像推理。为图像推理构建文本向量存储索引。针对图像推理向量存储进行查询以获取答案。
使用Table Transformer从检索到的图像中裁剪表格信息，然后将这些裁剪后的图像发送至GPT4-V以响应查询。
对裁剪后的表格图像应用OCR，并将数据发送至GPT4/GPT-3.5以回答查询。

经过测试，确定第三种方法最为有效。

此外，我们可以利用多模态模型从图像（PDF文件可轻松转换为图像）中提取或总结关键信息，如图10所示。

结论

总体而言，非结构化文档提供了高度的灵活性，并需要多种解析技术。然而，目前尚无关于最佳方法的共识。

在这种情况下，建议选择最适合您项目需求的方法。根据不同类型的PDF文件，建议采用特定的处理方式。例如，论文、书籍和财务报表可能基于其特性具有独特的设计。

尽管如此，如果条件允许，仍建议选择基于深度学习或多模态的方法。这些方法能有效将文档分割成定义明确且完整的信息单元，从而最大限度地保留文档的意图和结构。

最后，如果您有任何疑问，欢迎在评论区讨论。

Barry's Home

高级RAG 02揭秘PDF解析