从头构建数据摄取管道¶

在本教程中，我们将向您展示如何构建一个数据摄取管道到向量数据库中。

我们使用 Pinecone 作为向量数据库。

我们将展示如何执行以下操作：

如何加载文档。
如何使用文本分割器分割文档。
如何手动从每个文本块构建节点。
[可选] 为每个节点添加元数据。
如何为每个文本块生成嵌入。
如何插入到向量数据库中。

Pinecone¶

本教程需要一个 pinecone.io API 密钥。您可以免费注册一个入门账户。

如果您创建了一个入门账户，您可以随意命名您的应用程序。

拥有账户后，在 Pinecone 控制台中导航到“API Keys”。您可以使用默认密钥或为本教程创建一个新密钥。

保存您的 API 密钥及其环境（免费账户为 gcp_starter）。您将在下面用到它们。

如果您正在 Colab 上打开此 Notebook，您可能需要安装 LlamaIndex 🦙。

In [ ]

已复制！

%pip install llama-index-embeddings-openai
%pip install llama-index-vector-stores-pinecone
%pip install llama-index-llms-openai
%pip install llama-index-embeddings-openai %pip install llama-index-vector-stores-pinecone %pip install llama-index-llms-openai

In [ ]

已复制！

!pip install llama-index
!pip install llama-index

OpenAI¶

本教程需要一个 OpenAI API 密钥。登录您的 platform.openai.com 账户，点击右上角的头像，然后从菜单中选择“API Keys”。为本教程创建一个 API 密钥并保存。您将在下面用到它。

环境¶

首先我们添加依赖项。

In [ ]

已复制！

!pip -q install python-dotenv pinecone-client llama-index pymupdf
!pip -q install python-dotenv pinecone-client llama-index pymupdf

设置环境变量¶

我们为环境变量创建一个文件。不要提交或分享此文件！

注意：Google Colab 允许您创建但不能打开 .env 文件

In [ ]

已复制！





dotenv_path = (
    "env"  # Google Colabs will not let you open a .env, but you can set
)
with open(dotenv_path, "w") as f:
    f.write('PINECONE_API_KEY="<your api key>"\n')
    f.write('OPENAI_API_KEY="<your api key>"\n')
dotenv_path = ( "env" # Google Colab 不会允许您打开 .env 文件，但您可以设置 ) with open(dotenv_path, "w") as f: f.write('PINECONE_API_KEY=""\n') f.write('OPENAI_API_KEY=""\n')

在我们创建的文件中设置您的 OpenAI API 密钥、Pinecone API 密钥和环境。

In [ ]

已复制！

import os
from dotenv import load_dotenv
import os from dotenv import load_dotenv

In [ ]

已复制！

load_dotenv(dotenv_path=dotenv_path)
load_dotenv(dotenv_path=dotenv_path)

设置¶

我们构建一个空的 Pinecone 索引，并定义必要的 LlamaIndex 包装器/抽象，以便我们可以开始将数据加载到 Pinecone 中。

注意：不要在代码中保存您的 API 密钥或将 pinecone_env 添加到您的仓库！

In [ ]

已复制！

from pinecone import Pinecone, Index, ServerlessSpec
from pinecone import Pinecone, Index, ServerlessSpec

In [ ]

已复制！

api_key = os.environ["PINECONE_API_KEY"]
pc = Pinecone(api_key=api_key)
api_key = os.environ["PINECONE_API_KEY"] pc = Pinecone(api_key=api_key)

In [ ]

已复制！

index_name = "llamaindex-rag-fs"
index_name = "llamaindex-rag-fs"

In [ ]

已复制！

# [Optional] Delete the index before re-running the tutorial.
# pinecone.delete_index(index_name)
# [可选] 在重新运行教程之前删除索引。 # pinecone.delete_index(index_name)

In [ ]

已复制！





# dimensions are for text-embedding-ada-002
if index_name not in pc.list_indexes().names():
    pc.create_index(
        index_name,
        dimension=1536,
        metric="euclidean",
        spec=ServerlessSpec(cloud="aws", region="us-east-1"),
    )
# 维度适用于 text-embedding-ada-002 if index_name not in pc.list_indexes().names(): pc.create_index( index_name, dimension=1536, metric="euclidean", spec=ServerlessSpec(cloud="aws", region="us-east-1"), )

In [ ]

已复制！

pinecone_index = pc.Index(index_name)
pinecone_index = pc.Index(index_name)

In [ ]

已复制！

# [Optional] drop contents in index - will not work on free accounts
pinecone_index.delete(deleteAll=True)
# [可选] 删除索引中的内容 - 在免费账户上无法使用 pinecone_index.delete(deleteAll=True)

创建 PineconeVectorStore¶

用于 LlamaIndex 的简单包装器抽象。包装在 StorageContext 中，以便我们可以轻松加载节点。

In [ ]

已复制！

from llama_index.vector_stores.pinecone import PineconeVectorStore
from llama_index.vector_stores.pinecone import PineconeVectorStore

In [ ]

已复制！

vector_store = PineconeVectorStore(pinecone_index=pinecone_index)
vector_store = PineconeVectorStore(pinecone_index=pinecone_index)

从头构建摄取管道¶

我们将展示如何像引言中提到的那样构建一个摄取管道。

注意，步骤 (2) 和 (3) 可以通过我们的 NodeParser 抽象来处理，它们负责分割和节点创建。

为了本教程的目的，我们将向您展示如何手动创建这些对象。

1. 加载数据¶

In [ ]

已复制！

!mkdir data
!wget --user-agent "Mozilla" "https://arxiv.org/pdf/2307.09288.pdf" -O "data/llama2.pdf"
!mkdir data !wget --user-agent "Mozilla" "https://arxiv.org/pdf/2307.09288.pdf" -O "data/llama2.pdf"

--2023-10-13 01:45:14--  https://arxiv.org/pdf/2307.09288.pdf
Resolving arxiv.org (arxiv.org)... 128.84.21.199
Connecting to arxiv.org (arxiv.org)|128.84.21.199|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13661300 (13M) [application/pdf]
Saving to: ‘data/llama2.pdf’

data/llama2.pdf     100%[===================>]  13.03M  7.59MB/s    in 1.7s    

2023-10-13 01:45:16 (7.59 MB/s) - ‘data/llama2.pdf’ saved [13661300/13661300]

In [ ]

已复制！

import fitz
import fitz

In [ ]

已复制！

file_path = "./data/llama2.pdf"
doc = fitz.open(file_path)
file_path = "./data/llama2.pdf" doc = fitz.open(file_path)

2. 使用文本分割器分割文档¶

在这里，我们导入 SentenceSplitter 将文档文本分割成更小的块，同时尽可能保留段落/句子。

In [ ]

已复制！

from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.node_parser import SentenceSplitter

In [ ]

已复制！

text_parser = SentenceSplitter(
    chunk_size=1024,
    # separator=" ",
)
text_parser = SentenceSplitter( chunk_size=1024, # separator=" ", )

In [ ]

已复制！





text_chunks = []
# maintain relationship with source doc index, to help inject doc metadata in (3)
doc_idxs = []
for doc_idx, page in enumerate(doc):
    page_text = page.get_text("text")
    cur_text_chunks = text_parser.split_text(page_text)
    text_chunks.extend(cur_text_chunks)
    doc_idxs.extend([doc_idx] * len(cur_text_chunks))
text_chunks = [] # 保持与源文档索引的关系，以便在 (3) 中注入文档元数据 doc_idxs = [] for doc_idx, page in enumerate(doc): page_text = page.get_text("text") cur_text_chunks = text_parser.split_text(page_text) text_chunks.extend(cur_text_chunks) doc_idxs.extend([doc_idx] * len(cur_text_chunks))

3. 手动从文本块构建节点¶

我们将每个块转换为 TextNode 对象，这是 LlamaIndex 中一个低级别的数据抽象，用于存储内容，同时也允许定义元数据以及与其他节点的关系。

我们将文档中的元数据注入到每个节点中。

这实质上复制了我们 SentenceSplitter 中的逻辑。

In [ ]

已复制！

from llama_index.core.schema import TextNode
from llama_index.core.schema import TextNode

In [ ]

已复制！





nodes = []
for idx, text_chunk in enumerate(text_chunks):
    node = TextNode(
        text=text_chunk,
    )
    src_doc_idx = doc_idxs[idx]
    src_page = doc[src_doc_idx]
    nodes.append(node)
nodes = [] for idx, text_chunk in enumerate(text_chunks): node = TextNode( text=text_chunk, ) src_doc_idx = doc_idxs[idx] src_page = doc[src_doc_idx] nodes.append(node)

In [ ]

已复制！

print(nodes[0].metadata)
print(nodes[0].metadata)

In [ ]

已复制！

# print a sample node
print(nodes[0].get_content(metadata_mode="all"))
# 打印一个示例节点 print(nodes[0].get_content(metadata_mode="all"))

[可选] 4. 从每个节点提取元数据¶

我们使用元数据提取器从每个节点提取元数据。

这将为每个节点添加更多元数据。

In [ ]

已复制！





from llama_index.core.extractors import (
    QuestionsAnsweredExtractor,
    TitleExtractor,
)
from llama_index.core.ingestion import IngestionPipeline
from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-3.5-turbo")

extractors = [
    TitleExtractor(nodes=5, llm=llm),
    QuestionsAnsweredExtractor(questions=3, llm=llm),
]
from llama_index.core.extractors import ( QuestionsAnsweredExtractor, TitleExtractor, ) from llama_index.llms.openai import OpenAI llm = OpenAI(model="gpt-3.5-turbo") extractors = [ TitleExtractor(nodes=5, llm=llm), QuestionsAnsweredExtractor(questions=3, llm=llm), ]

In [ ]

已复制！

pipeline = IngestionPipeline(
    transformations=extractors,
)
nodes = await pipeline.arun(nodes=nodes, in_place=False)
pipeline = IngestionPipeline( transformations=extractors, ) nodes = await pipeline.arun(nodes=nodes, in_place=False)

In [ ]

已复制！

print(nodes[0].metadata)
print(nodes[0].metadata)

5. 为每个节点生成嵌入¶

使用我们的 OpenAI 嵌入模型（text-embedding-ada-002）为每个节点生成文档嵌入。

将这些嵌入存储在每个节点的 embedding 属性上。

In [ ]

已复制！

from llama_index.embeddings.openai import OpenAIEmbedding

embed_model = OpenAIEmbedding()
from llama_index.embeddings.openai import OpenAIEmbedding embed_model = OpenAIEmbedding()

In [ ]

已复制！

for node in nodes:
    node_embedding = embed_model.get_text_embedding(
        node.get_content(metadata_mode="all")
    )
    node.embedding = node_embedding
for node in nodes: node_embedding = embed_model.get_text_embedding( node.get_content(metadata_mode="all") ) node.embedding = node_embedding

6. 将节点加载到向量存储中¶

我们现在将这些节点插入到我们的 PineconeVectorStore 中。

注意：我们跳过了 VectorStoreIndex 抽象，它是一个更高级别的抽象，也处理摄取。我们在下一节中使用 VectorStoreIndex 来加速检索/查询。

In [ ]

已复制！

vector_store.add(nodes)
vector_store.add(nodes)

从向量存储中检索和查询¶

现在我们的摄取已完成，我们可以检索/查询此向量存储。

注意：我们可以在这里使用我们的高级 VectorStoreIndex 抽象。请参阅下一节，了解如何定义更低级别的检索！

In [ ]

已复制！

from llama_index.core import VectorStoreIndex
from llama_index.core import StorageContext
from llama_index.core import VectorStoreIndex from llama_index.core import StorageContext

In [ ]

已复制！

index = VectorStoreIndex.from_vector_store(vector_store)
index = VectorStoreIndex.from_vector_store(vector_store)

In [ ]

已复制！

query_engine = index.as_query_engine()
query_engine = index.as_query_engine()

In [ ]

已复制！

query_str = "Can you tell me about the key concepts for safety finetuning"
query_str = "Can you tell me about the key concepts for safety finetuning"

In [ ]

已复制！

response = query_engine.query(query_str)
response = query_engine.query(query_str)

In [ ]

已复制！

print(str(response))
print(str(response))