有效的AI驱动检索的关键

聪明的人都是懒惰的。 他们寻找最有效的方法来解决复杂的问题，最小化努力同时最大化结果。

在生成AI应用中，这种效率是通过分块实现的。就像将一本书分成章节使其更易于阅读一样，分块将重要文本分成更小、更易于处理和理解的部分。

在探讨分块的机制之前，了解这一技术所运作的更广泛框架是至关重要的：检索增强生成（Retrieval-Augmented Generation，RAG）。

什么是 RAG？

检索增强生成（RAG）是一种将检索机制与大型语言模型（LLM 模型）相结合的方法。它利用检索到的文档增强 AI 能力，以生成更准确和具有上下文丰富性的响应。

引入分块处理

分块处理是将大段文本拆分成更小、更易管理的部分。这个过程主要分为两个阶段：

数据准备：可靠的数据源被分割成块文档并存储在数据库中。如果在块中生成嵌入，数据库可以是一个向量存储。
检索：当用户提出问题时，系统通过向量搜索、全文搜索或两者的组合在文档块中进行搜索。这个过程识别并检索与用户查询最相关的块。

为什么 Chunking 在 RAG 架构中至关重要

Chunking 在 RAG 架构中是绝对必要的，因为它是决定您的 Gen AI 应用程序准确性的第一个元素。

块应该小以提高准确性： Chunking 使系统能够对较小的文本片段进行索引和搜索，从而提高找到相关文档的准确性。当发出查询时，系统可以快速定位最相关的块，从而提高检索过程的精确度。
块应该大以增强上下文生成： 不是所有块都应该小。通过处理较小的块，生成模型可以更好地理解和利用每个片段提供的上下文。这会导致更连贯和上下文准确的响应，因为模型可以利用特定的、相关的信息，而不是在一个大型、未分割的文档中筛选。
可扩展性和性能： Chunking 允许对大数据集进行更可扩展和高效的处理。它通过将数据分解为可管理的部分来减少计算负担，这些部分可以并行处理，从而提高 RAG 系统的整体性能。然而，应确保可扩展性

Chunking 是一种技术必要性和战略方法，确保强大、高效和可扩展的 RAG 系统。它增强了检索准确性、处理效率和资源利用率，在 RAG 应用的成功中发挥着至关重要的作用。

改进分块的技术

几种技术可以改进分块，从基本方法到高级方法不等：

固定字符大小： 简单明了，将文本分成固定字符数的块。
递归字符文本分割： 使用空格或标点符号等分隔符来创建更具上下文意义的块。
文档特定分割： 根据文档类型（如PDF或Markdown文件）定制分块方法。
语义分割： 使用嵌入根据语义内容对文本进行分块。
代理分割： 利用大型语言模型根据内容和上下文确定最佳分块方式。

通过采用这些技术，RAG系统可以实现更高的性能和更准确的结果，巩固其作为AI中重要工具的角色。

固定字符大小

固定字符大小分块是拆分文本的最基本方法。这种方法涉及将文本分成预定数量的字符块，而不考虑内容。这种方法简单明了，但缺乏对文本结构和上下文的考虑，这可能导致块的意义较低。

优点：

简单性： 实现简单，所需的计算资源极少。
一致性： 生成统一的块，简化后续处理。

缺点：

上下文忽视： 忽略文本的结构和含义，导致信息碎片化。
低效率： 可能会切断重要的上下文，需要额外处理以重新组合有意义的信息。

这里是如何使用之前提供的代码实现固定字符大小分块的示例：

# Sample text to chunk
text = "This is the text I would like to chunk up. It is the example text for this exercise."

# Set the chunk size
chunk_size = 35
# Initialize a list to hold the chunks
chunks = []
# Iterate over the text to create chunks
for i in range(0, len(text), chunk_size):
    chunk = text[i:i + chunk_size]
    chunks.append(chunk)
# Display the chunks
print(chunks)
# Output: ['This is the text I would like to ch', 'unk up. It is the example text for ', 'this exercise']

使用 LangChain 的 CharacterTextSplitter 实现相同的结果：

from langchain.text_splitter import CharacterTextSplitter

# Initialize the text splitter with specified chunk size
text_splitter = CharacterTextSplitter(chunk_size=35, chunk_overlap=0, separator='', strip_whitespace=False)
# Create documents using the text splitter
documents = text_splitter.create_documents([text])
# Display the created documents
for doc in documents:
    print(doc.page_content)
# Output: 
# This is the text I would like to ch
# unk up. It is the example text for 
# this exercise

固定字符大小分块是一种简单而基础的技术，通常作为在更复杂方法之前的基线。

递归字符文本拆分

递归字符文本拆分是一种更高级的技术，它考虑了文本的结构。它使用一系列分隔符递归地将文本划分为更有意义且上下文相关的块。

在上述示例中，块大小为30个字符，重叠为20个字符，RecursiveCharacterTextSplitter将尝试在保持逻辑边界的同时拆分文本。然而，这也表明由于块大小较小，它仍可能在单词或句子的中间进行拆分，这并不是最优的。

优点：

改善上下文： 该方法通过使用段落或句子等分隔符保留文本的自然结构。
灵活性： 允许不同的块大小和重叠，为块处理过程提供更好的控制。

缺点：

块大小很重要： 它应该是可管理的，但仍然至少包含一个短语或更多。否则，我们需要在检索块时获得精确度。
性能开销： 由于递归拆分和处理多个分隔符，要求更多的计算资源。而且与固定大小的块相比，我们生成的块更多。

以下是如何在 Langchain 中实现递归字符文本拆分的示例：

%pip install -qU langchain-text-splitters

首先安装 long-chain-text-splitters 库，如果您还没有这样做的话。

from langchain_text_splitters import RecursiveCharacterTextSplitter
# Sample text to chunk
text = """
The Olympic Games, originally held in ancient Greece, were revived in 1896 and
have since become the world’s foremost sports competition, bringing together 
athletes from around the globe.
"""
# Initialize the recursive character text splitter with specified chunk size
text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size=30,
    chunk_overlap=20,
    length_function=len,
    is_separator_regex=False,
)

# Create documents using the text splitter
documents = text_splitter.create_documents([text])
# Display the created documents
for doc in documents:
    print(doc.page_content)
# Output:
# “The Olympic Games, originally”
# “held in ancient Greece, were”
# “revived in 1896 and have”
# “have since become the world’s”
# “world’s foremost sports”
# “competition, bringing together”
# “together athletes from around”
# “around the globe.”

在此方法中，文本首先按较大的结构（如段落）进行拆分，如果块仍然太大，则使用较小的结构（如句子）进一步拆分。每个块保持有意义的上下文，避免截断重要信息。

递归字符文本拆分在简单性和复杂性之间取得了平衡，提供了一种强大的块处理方法，尊重文本的固有结构。

文档特定拆分

文档特定拆分根据不同的文档类型定制分块过程，例如 Markdown 文件、Python 脚本、JSON 文档或 HTML，确保每种类型以最适合其内容和结构的方式进行拆分。

例如，Markdown 在 GitHub、Medium 和 Confluence 等平台上被广泛使用，使其成为 RAG 系统中摄取的自然选择，在这些系统中，干净、结构化的数据对于生成准确的响应至关重要。

此外，还为各种编程语言提供了特定语言的拆分器，包括 C++、Go、Java、Python 等，确保代码能够有效地进行分析和检索。

优点：

相关性： 使用最合适的方法对不同文档类型进行拆分，保留其逻辑结构。
精确性： 根据每种文档类型的独特特征定制拆分过程。

缺点：

复杂的实现： 需要针对不同文档类型采用不同的分块策略和库。
维护： 由于方法的多样性，维护变得更加复杂。

这里是一个如何实现针对Markdown和Python文件的文档特定拆分的示例：

Markdown 分割

from langchain.text_splitter import MarkdownTextSplitter
# 示例 Markdown 文本
markdown_text = """
# 加州的乐趣
## 驾驶
尝试沿着 1 号公路开车前往圣地亚哥
### 食物
确保在那里的时候吃一个卷饼
## 徒步旅行
去优胜美地
"""
# 初始化 Markdown 文本分割器
splitter = MarkdownTextSplitter(chunk_size=40, chunk_overlap=0)
# 使用文本分割器创建文档
documents = splitter.create_documents([markdown_text])
# 显示创建的文档
for doc in documents:
    print(doc.page_content)
# 输出:
# # 加州的乐趣\n\n## 驾驶
# 尝试沿着 1 号公路开车前往圣地亚哥
# ### 食物
# 确保在那里的时候吃一个卷饼
# ## 徒步旅行\n\n去优胜美地

Python 代码拆分

from langchain.text_splitter import PythonCodeTextSplitter
# Sample Python code
python_text = """
class Person:
    def __init__(self, name, age):
        self.name = name
        self.age = age
p1 = Person("John", 36)
for i in range(10):
    print(i)
"""
# Initialize the Python code text splitter
python_splitter = PythonCodeTextSplitter(chunk_size=100, chunk_overlap=0)
# Create documents using the text splitter
documents = python_splitter.create_documents([python_text])
# Display the created documents
for doc in documents:
    print(doc.page_content)
# Output:
# class Person:\n    def __init__(self, name, age):\n        self.name = name\n        self.age = age
# p1 = Person("John", 36)\n\nfor i in range(10):\n    print(i)

特定文档的拆分保留了文档的逻辑结构，使得块更加有意义且上下文准确。例如，在Markdown文件中，标题和部分是分开的，而在Python代码中使用类和函数。

这种方法通过保持不同文档类型的完整性，增强了系统检索和生成相关响应的能力，从而提高了RAG系统的整体性能和准确性。

语义拆分

与以前的任意长度或语法规则的拆分方法不同，语义拆分通过使用文本的含义来确定块边界，将分块提升到一个新的水平。

该方法利用嵌入将语义相似的内容分组，确保每个块包含上下文一致的信息。

上面的图示说明了语义分块的工作流程，从句子拆分开始，然后生成嵌入，最后根据相似性对句子进行分组。该过程确保块在语义上是一致的，从而增强信息检索的相关性和准确性。

让我们通过一个示例来看一下该方法的输出。

该图提供了一个实际示例，说明如何使用余弦相似度将句子分组为块。主题相关的句子被分组，而意义不同的句子则保持分开。视觉解释清楚地说明了如何应用语义分块以保持文本中的上下文和一致性。

优点：

上下文相关性： 确保内容块包含语义相似的内容，提高信息检索和生成的准确性。
动态适应性： 可以根据意义而非严格规则适应各种文本结构和内容类型。

缺点：

计算开销： 需要额外的计算资源来生成和比较嵌入。
复杂性： 与更简单的拆分方法相比，实现起来更复杂。

以下是如何使用嵌入实现语义拆分的示例。此代码来自 Greg Kamradt 的笔记本：5_Levels_Of_Text_Splitting 。

from sklearn.metrics.pairwise import cosine_similarity
from langchain.embeddings import OpenAIEmbeddings
import re
# Sample text
text = """
One of the most important things I didn't understand about the world when I was a child is the degree to which the returns for performance are superlinear.
Teachers and coaches implicitly told us the returns were linear. "You get out," I heard a thousand times, "what you put in." They meant well, but this is rarely true. If your product is only half as good as your competitor's, you don't get half as many customers. You get no customers, and you go out of business.
It's obviously true that the returns for performance are superlinear in business. Some think this is a flaw of capitalism, and that if we changed the rules it would stop being true. But superlinear returns for performance are a feature of the world, not an artifact of rules we've invented. We see the same pattern in fame, power, military victories, knowledge, and even benefit to humanity. In all of these, the rich get richer.
"""
# Splitting the text into sentences
sentences = re.split(r'(?<=[.?!])\s+', text)
sentences = [{'sentence': x, 'index' : i} for i, x in enumerate(sentences)]
# Combine sentences for context
def combine_sentences(sentences, buffer_size=1):
    for i in range(len(sentences)):
        combined_sentence = ''
        for j in range(i - buffer_size, i):
            if j >= 0:
                combined_sentence += sentences[j]['sentence'] + ' '
        combined_sentence += sentences[i]['sentence']
        for j in range(i + 1, i + 1 + buffer_size):
            if j < len(sentences):
                combined_sentence += ' ' + sentences[j]['sentence']
        sentences[i]['combined_sentence'] = combined_sentence
    return sentences
sentences = combine_sentences(sentences)
# Generate embeddings
oai_embeds = OpenAIEmbeddings()
embeddings = oai_embeds.embed_documents([x['combined_sentence'] for x in sentences])
# Add embeddings to sentences
for i, sentence in enumerate(sentences):
    sentence['combined_sentence_embedding'] = embeddings[i]
# Calculate cosine distances
def calculate_cosine_distances(sentences):
    distances = []
    for i in range(len(sentences) - 1):
        embedding_current = sentences[i]['combined_sentence_embedding']
        embedding_next = sentences[i + 1]['combined_sentence_embedding']
        similarity = cosine_similarity([embedding_current], [embedding_next])[0][0]
        distance = 1 - similarity
        distances.append(distance)
        sentences[i]['distance_to_next'] = distance
    return distances, sentences
distances, sentences = calculate_cosine_distances(sentences)
# Determine breakpoints and create chunks
import numpy as np
breakpoint_distance_threshold = np.percentile(distances, 95)
indices_above_thresh = [i for i, x in enumerate(distances) if x > breakpoint_distance_threshold]
# Combine sentences into chunks
chunks = []
start_index = 0
for index in indices_above_thresh:
    end_index = index
    group = sentences[start_index:end_index + 1]
    combined_text = ' '.join([d['sentence'] for d in group])
    chunks.append(combined_text)
    start_index = index + 1
if start_index < len(sentences):
    combined_text = ' '.join([d['sentence'] for d in sentences[start_index:]])
    chunks.append(combined_text)
# Display the created chunks
for i, chunk in enumerate(chunks):
    print(f"Chunk #{i+1}:\n{chunk}\n")

语义拆分使用嵌入创建语义相似的块，提高了 RAG 系统中的检索准确性和上下文生成。专注于文本的含义确保每个块包含连贯且相关的信息，从而增强 RAG 应用的性能和可靠性。

代理分割

代理分割利用大型语言模型的能力，根据文本的语义理解动态创建块。

这种先进的方法通过评估内容和上下文来确定最佳块边界，模仿人类的分块方式。

代理分割器并不依赖于预定义的规则或纯粹的统计方法，而是通过动态评估内容来处理文本，类似于一个人阅读文档并根据思想的流动和句子的上下文决定在哪里分割。这种方法增强了结果块的连贯性和相关性。

优点：

高精度： 通过使用复杂的语言模型，提供高度相关和上下文准确的片段。
适应性： 能够处理多种类型的文本，并动态调整分块策略。

缺点：

资源密集和额外的 LLM 成本： 运行大型语言模型需要大量的计算资源。
复杂的实施： 涉及设置和微调语言模型以达到最佳性能。

如何在 LangGraph 中实现代理分割器

了解 LangGraph 中的节点： LangGraph 中的节点代表工作流程中的操作或步骤。每个节点接收输入，处理它，并生成传递给下一个节点的输出。

from langgraph.nodes import InputNode, SentenceSplitterNode, LLMDecisionNode, ChunkingNode

# Step 1: Input Node
input_node = InputNode(name="Document Input")

# Step 2: Sentence Splitting Node
splitter_node = SentenceSplitterNode(input=input_node.output, name="Sentence Splitter")

# Step 3: LLM Decision Node
decision_node = LLMDecisionNode(
    input=splitter_node.output, 
    prompt_template="Does the sentence '{next_sentence}' belong to the same chunk as '{current_chunk}'?", 
    name="LLM Decision"
)

# Step 4: Chunking Node
chunking_node = ChunkingNode(input=decision_node.output, name="Semantic Chunking")

# Run the graph
document = "Your document text here..."
result = chunking_node.run(document=document)
print(result)

结论

总之，分块是优化检索增强生成（RAG）系统的关键策略，可以实现更准确、上下文相关和可扩展的响应。

通过将大型文本拆分为可管理的部分，我们提高了检索的准确性，并改善了人工智能应用的整体效率。

采用先进的分块技术对人工智能驱动解决方案的持续成功和进步至关重要。

Barry's Home

分块艺术提升 RAG 架构中 AI 性能