PostgresML 管理索引¶

在本 notebook 中，我们将展示如何将 PostgresML 与 LlamaIndex 一起使用。

如果您在 Colab 上打开此 Notebook，则可能需要安装 LlamaIndex 🦙。

In [ ]

已复制！

!pip install llama-index-indices-managed-postgresml
!pip install llama-index-indices-managed-postgresml

In [ ]

已复制！

!pip install llama-index
!pip install llama-index

In [ ]

已复制！

from llama_index.indices.managed.postgresml import PostgresMLIndex

from llama_index.core import SimpleDirectoryReader

# Need this as asyncio can get pretty wild with notebooks and this prevents event loop errors
import nest_asyncio

nest_asyncio.apply()
from llama_index.indices.managed.postgresml import PostgresMLIndex from llama_index.core import SimpleDirectoryReader # 需要这个，因为 asyncio 在 notebook 中可能会变得非常复杂，这可以防止事件循环错误 import nest_asyncio nest_asyncio.apply()

加载文档¶

加载 paul_graham_essay.txt 文档。

In [ ]

已复制！

!mkdir data
!curl -o data/paul_graham_essay.txt https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt
!mkdir data !curl -o data/paul_graham_essay.txt https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt

In [ ]

已复制！

documents = SimpleDirectoryReader("data").load_data()
print(f"documents loaded into {len(documents)} document objects")
print(f"Document ID of first doc is {documents[0].doc_id}")
documents = SimpleDirectoryReader("data").load_data() print(f"documents loaded into {len(documents)} document objects") print(f"Document ID of first doc is {documents[0].doc_id}")

将文档 Upsert 到您的 PostgresML 数据库中¶

首先，让我们设置 PostgresML 数据库的 URL。如果您还没有 URL，可以在这里免费创建一个：https://postgresml.org/signup

In [ ]

已复制！

# Let's set some secrets we need
from google.colab import userdata

PGML_DATABASE_URL = userdata.get("PGML_DATABASE_URL")

# If you don't have those secrets set, uncomment the lines below and run them instead
# Make sure to replace {REPLACE_ME} with your keys
# PGML_DATABASE_URL = "{REPLACE_ME}"
# 从 google.colab 中设置一些我们需要的 secrets import userdata PGML_DATABASE_URL = userdata.get("PGML_DATABASE_URL") # 如果您没有设置这些 secrets，请取消注释下面的行并运行它们 # 确保将 {REPLACE_ME} 替换为您的密钥 # PGML_DATABASE_URL = "{REPLACE_ME}"

In [ ]

已复制！

index = PostgresMLIndex.from_documents(
    documents,
    collection_name="llama-index-example-demo",
    pgml_database_url=PGML_DATABASE_URL,
)
index = PostgresMLIndex.from_documents( documents, collection_name="llama-index-example-demo", pgml_database_url=PGML_DATABASE_URL, )

查询 PostgresML 索引¶

现在我们可以使用 PostgresMLIndex 检索器来提问了。

In [ ]

已复制！

query = "What did the author write about?"
query = "What did the author write about?"

我们可以使用检索器来搜索我们的文档列表。

In [ ]

已复制！





retriever = index.as_retriever()
response = retriever.retrieve(query)
texts = [t.node.text for t in response]

print("The Nodes:")
print(response)
print("\nThe Texts")
print(texts)
retriever = index.as_retriever() response = retriever.retrieve(query) texts = [t.node.text for t in response] print("The Nodes:") print(response) print("\nThe Texts") print(texts)

PostgresML 允许在进行检索的同一查询中轻松地进行二次排序。

In [ ]

已复制！





retriever = index.as_retriever(
    limit=2,  # Limit to returning the 2 most related Nodes
    rerank={
        "model": "mixedbread-ai/mxbai-rerank-base-v1",  # Use the mxbai-rerank-base model for reranking
        "num_documents_to_rerank": 100,  # Rerank up to 100 results returned from the vector search
    },
)
response = retriever.retrieve(query)
texts = [t.node.text for t in response]

print("The Nodes:")
print(response)
print("\nThe Texts")
print(texts)
retriever = index.as_retriever( limit=2, # 限制返回 2 个最相关的节点 rerank={ "model": "mixedbread-ai/mxbai-rerank-base-v1", # 使用 mxbai-rerank-base 模型进行二次排序 "num_documents_to_rerank": 100, # 对向量搜索返回的多达 100 个结果进行二次排序 }, ) response = retriever.retrieve(query) texts = [t.node.text for t in response] print("The Nodes:") print(response) print("\nThe Texts") print(texts)

使用 as_query_engine()，我们可以在一个查询中提问并获取响应。

In [ ]

已复制！

query_engine = index.as_query_engine()
response = query_engine.query(query)

print("The Response:")
print(response)
print("\nThe Source Nodes:")
print(response.get_formatted_sources())
query_engine = index.as_query_engine() response = query_engine.query(query) print("The Response:") print(response) print("\nThe Source Nodes:") print(response.get_formatted_sources())

请注意，上面的“response”对象不仅包含摘要文本，还包含用于提供此响应的源文档（引用）。请注意，源节点都来自同一个文档。这是因为我们只上传了一个文档，PostgresML 在为其生成嵌入之前自动进行了分割。所有参数都可以控制。有关更多信息，请参阅文档。

创建 query_engine 时，可以通过传递 streaming=True 来启用流式处理。

注意：由于网络连接问题，流式处理在 Google Colab 上非常慢。

In [ ]

已复制！

query_engine = index.as_query_engine(streaming=True)
results = query_engine.query(query)
for text in results.response_gen:
    print(text, end="", flush=True)
query_engine = index.as_query_engine(streaming=True) results = query_engine.query(query) for text in results.response_gen: print(text, end="", flush=True)