递归检索器 + 文档代理¶
本指南展示了如何结合递归检索和“文档代理”,对异构文档进行高级决策。
有两种驱动因素促成了更好的检索解决方案:
- 将检索嵌入与基于块的合成解耦。通常,通过文档摘要获取文档会返回比原始块更相关的查询上下文。这是递归检索直接允许的。
- 在文档内部,用户可能需要动态执行超出基于事实问答的任务。我们引入了“文档代理”的概念——这些代理可以访问给定文档的向量搜索和摘要工具。
设置和下载数据¶
在本节中,我们将定义导入,然后下载有关不同城市的维基百科文章。每篇文章单独存储。
如果您在 Colab 中打开此 Notebook,则可能需要安装 LlamaIndex 🦙。
In [ ]
已复制!
%pip install llama-index-llms-openai
%pip install llama-index-agent-openai
%pip install llama-index-llms-openai %pip install llama-index-agent-openai
In [ ]
已复制!
!pip install llama-index
!pip install llama-index
In [ ]
已复制!
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core import SummaryIndex
from llama_index.core.schema import IndexNode
from llama_index.core.tools import QueryEngineTool, ToolMetadata
from llama_index.llms.openai import OpenAI
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader from llama_index.core import SummaryIndex from llama_index.core.schema import IndexNode from llama_index.core.tools import QueryEngineTool, ToolMetadata from llama_index.llms.openai import OpenAI
In [ ]
已复制!
wiki_titles = ["Toronto", "Seattle", "Chicago", "Boston", "Houston"]
wiki_titles = ["Toronto", "Seattle", "Chicago", "Boston", "Houston"]
In [ ]
已复制!
from pathlib import Path
import requests
for title in wiki_titles:
response = requests.get(
"https://en.wikipedia.org/w/api.php",
params={
"action": "query",
"format": "json",
"titles": title,
"prop": "extracts",
# 'exintro': True,
"explaintext": True,
},
).json()
page = next(iter(response["query"]["pages"].values()))
wiki_text = page["extract"]
data_path = Path("data")
if not data_path.exists():
Path.mkdir(data_path)
with open(data_path / f"{title}.txt", "w") as fp:
fp.write(wiki_text)
from pathlib import Path import requests for title in wiki_titles: response = requests.get( "https://en.wikipedia.org/w/api.php", params={ "action": "query", "format": "json", "titles": title, "prop": "extracts", # 'exintro': True, "explaintext": True, }, ).json() page = next(iter(response["query"]["pages"].values())) wiki_text = page["extract"] data_path = Path("data") if not data_path.exists(): Path.mkdir(data_path) with open(data_path / f"{title}.txt", "w") as fp: fp.write(wiki_text)
In [ ]
已复制!
# Load all wiki documents
city_docs = {}
for wiki_title in wiki_titles:
city_docs[wiki_title] = SimpleDirectoryReader(
input_files=[f"data/{wiki_title}.txt"]
).load_data()
# 加载所有维基文档 city_docs = {} for wiki_title in wiki_titles: city_docs[wiki_title] = SimpleDirectoryReader( input_files=[f"data/{wiki_title}.txt"] ).load_data()
定义 LLM + Service Context + Callback Manager
In [ ]
已复制!
import os
os.environ["OPENAI_API_KEY"] = "sk-..."
import os os.environ["OPENAI_API_KEY"] = "sk-..."
In [ ]
已复制!
from llama_index.core import Settings
Settings.llm = OpenAI(temperature=0, model="gpt-3.5-turbo")
from llama_index.core import Settings Settings.llm = OpenAI(temperature=0, model="gpt-3.5-turbo")
为每个文档构建文档代理¶
在本节中,我们为每个文档定义“文档代理”。
首先,我们为每个文档定义向量索引(用于语义搜索)和摘要索引(用于摘要)。然后将这两个查询引擎转换为工具,传递给 OpenAI 函数调用代理。
此文档代理可以动态选择在给定文档内执行语义搜索或摘要。
我们为每个城市创建单独的文档代理。
In [ ]
已复制!
from llama_index.agent.openai import OpenAIAgent
# Build agents dictionary
agents = {}
for wiki_title in wiki_titles:
# build vector index
vector_index = VectorStoreIndex.from_documents(
city_docs[wiki_title],
)
# build summary index
summary_index = SummaryIndex.from_documents(
city_docs[wiki_title],
)
# define query engines
vector_query_engine = vector_index.as_query_engine()
list_query_engine = summary_index.as_query_engine()
# define tools
query_engine_tools = [
QueryEngineTool(
query_engine=vector_query_engine,
metadata=ToolMetadata(
name="vector_tool",
description=(
f"Useful for retrieving specific context from {wiki_title}"
),
),
),
QueryEngineTool(
query_engine=list_query_engine,
metadata=ToolMetadata(
name="summary_tool",
description=(
"Useful for summarization questions related to"
f" {wiki_title}"
),
),
),
]
# build agent
function_llm = OpenAI(model="gpt-3.5-turbo-0613")
agent = OpenAIAgent.from_tools(
query_engine_tools,
llm=function_llm,
verbose=True,
)
agents[wiki_title] = agent
from llama_index.agent.openai import OpenAIAgent # 构建代理字典 agents = {} for wiki_title in wiki_titles: # 构建向量索引 vector_index = VectorStoreIndex.from_documents( city_docs[wiki_title], ) # 构建摘要索引 summary_index = SummaryIndex.from_documents( city_docs[wiki_title], ) # 定义查询引擎 vector_query_engine = vector_index.as_query_engine() list_query_engine = summary_index.as_query_engine() # 定义工具 query_engine_tools = [ QueryEngineTool( query_engine=vector_query_engine, metadata=ToolMetadata( name="vector_tool", description=( f"Useful for retrieving specific context from {wiki_title}" ), ), ), QueryEngineTool( query_engine=list_query_engine, metadata=ToolMetadata( name="summary_tool", description=( "Useful for summarization questions related to" f" {wiki_title}" ), ), ), ] # 构建代理 function_llm = OpenAI(model="gpt-3.5-turbo-0613") agent = OpenAIAgent.from_tools( query_engine_tools, llm=function_llm, verbose=True, ) agents[wiki_title] = agent
在这些代理之上构建可组合检索器¶
现在我们定义一组摘要节点,每个节点链接到相应的维基百科城市文章。然后,我们在这些节点之上定义一个可组合的检索器 + 查询引擎,以便将查询路由到给定的节点,该节点又会将其路由到相关的文档代理。
In [ ]
已复制!
# define top-level nodes
objects = []
for wiki_title in wiki_titles:
# define index node that links to these agents
wiki_summary = (
f"This content contains Wikipedia articles about {wiki_title}. Use"
" this index if you need to lookup specific facts about"
f" {wiki_title}.\nDo not use this index if you want to analyze"
" multiple cities."
)
node = IndexNode(
text=wiki_summary, index_id=wiki_title, obj=agents[wiki_title]
)
objects.append(node)
# 定义顶层节点 objects = [] for wiki_title in wiki_titles: # 定义链接到这些代理的索引节点 wiki_summary = ( f"此内容包含关于 {wiki_title} 的维基百科文章。如果您需要查找关于 {wiki_title} 的特定事实,请使用此索引。\n如果您想分析多个城市,请勿使用此索引。" ) node = IndexNode( text=wiki_summary, index_id=wiki_title, obj=agents[wiki_title] ) objects.append(node)
In [ ]
已复制!
# define top-level retriever
vector_index = VectorStoreIndex(
objects=objects,
)
query_engine = vector_index.as_query_engine(similarity_top_k=1, verbose=True)
# 定义顶层检索器 vector_index = VectorStoreIndex( objects=objects, ) query_engine = vector_index.as_query_engine(similarity_top_k=1, verbose=True)
运行示例查询¶
In [ ]
已复制!
# should use Boston agent -> vector tool
response = query_engine.query("Tell me about the sports teams in Boston")
# 应该使用 Boston 代理 -> vector 工具 response = query_engine.query("Tell me about the sports teams in Boston")
Retrieval entering Boston: OpenAIAgent Retrieving from object OpenAIAgent with query Tell me about the sports teams in Boston Added user message to memory: Tell me about the sports teams in Boston
In [ ]
已复制!
print(response)
print(response)
Boston is home to several professional sports teams across different leagues, including a successful baseball team in Major League Baseball, a highly successful American football team in the National Football League, one of the most successful basketball teams in the NBA, a professional ice hockey team in the National Hockey League, and a professional soccer team in Major League Soccer. These teams have a rich history, passionate fan bases, and have achieved great success both locally and nationally.
In [ ]
已复制!
# should use Houston agent -> vector tool
response = query_engine.query("Tell me about the sports teams in Houston")
# 应该使用 Houston 代理 -> vector 工具 response = query_engine.query("Tell me about the sports teams in Houston")
Retrieval entering Houston: OpenAIAgent Retrieving from object OpenAIAgent with query Tell me about the sports teams in Houston Added user message to memory: Tell me about the sports teams in Houston
In [ ]
已复制!
print(response)
print(response)
Houston is home to several professional sports teams across different leagues, including the Houston Texans in the NFL, the Houston Rockets in the NBA, the Houston Astros in MLB, the Houston Dynamo in MLS, and the Houston Dash in NWSL. These teams compete in football, basketball, baseball, soccer, and women's soccer respectively, and have achieved various levels of success in their respective leagues. Additionally, the city also has minor league baseball, hockey, and other sports teams that cater to sports enthusiasts.
In [ ]
已复制!
# should use Seattle agent -> summary tool
response = query_engine.query(
"Give me a summary on all the positive aspects of Chicago"
)
# 应该使用 Seattle 代理 -> summary 工具 response = query_engine.query( "Give me a summary on all the positive aspects of Chicago" )
Retrieval entering Chicago: OpenAIAgent Retrieving from object OpenAIAgent with query Give me a summary on all the positive aspects of Chicago Added user message to memory: Give me a summary on all the positive aspects of Chicago === Calling Function === Calling function: summary_tool with args: { "input": "positive aspects of Chicago" } Got output: Chicago is recognized for its robust economy, acting as a key hub for finance, culture, commerce, industry, education, technology, telecommunications, and transportation. It stands out in the derivatives market and is a top-ranking city in terms of gross domestic product. Chicago is a favored destination for tourists, known for its rich art scene covering visual arts, literature, film, theater, comedy, food, dance, and music. The city hosts prestigious educational institutions and professional sports teams across different leagues. ========================
In [ ]
已复制!
print(response)
print(response)
Chicago is known for its strong economy with a focus on finance, culture, commerce, industry, education, technology, telecommunications, and transportation. It is a major player in the derivatives market and boasts a high gross domestic product. The city is a popular tourist destination with a vibrant art scene that includes visual arts, literature, film, theater, comedy, food, dance, and music. Additionally, Chicago is home to prestigious educational institutions and professional sports teams across various leagues.