如何构建一个聊天机器人¶
LlamaIndex 作为您的数据与大型语言模型(LLMs)之间的桥梁,提供了一套工具包,使您能够围绕数据建立查询接口,用于问答和摘要等多种任务。
在本教程中,我们将引导您使用数据代理构建一个上下文增强的聊天机器人。这个由 LLM 驱动的代理能够智能地执行针对您数据的任务。最终结果是一个聊天机器人代理,它配备了 LlamaIndex 提供的一套强大的数据接口工具,用于回答关于您数据的查询。
注意:本教程基于对 SEC 10-K 文件创建查询接口的初步工作 - 点此查看。
上下文¶
在本指南中,我们将构建一个“10-K 聊天机器人”,它使用来自 Dropbox 的 UBER 原始 10-K HTML 文件。用户可以与聊天机器人互动,询问与 10-K 文件相关的问题。
准备工作¶
%pip install llama-index-readers-file
%pip install llama-index-embeddings-openai
%pip install llama-index-agent-openai
%pip install llama-index-llms-openai
%pip install llama-index-question-gen-openai
%pip install unstructured
import os
os.environ["OPENAI_API_KEY"] = "sk-..."
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
# global defaults
Settings.llm = OpenAI(model="gpt-4o-mini")
Settings.embed_model = OpenAIEmbedding(model_name="text-embedding-3-large")
Settings.chunk_size = 512
Settings.chunk_overlap = 64
摄取数据¶
首先,让我们下载 2019-2022 年的原始 10-K 文件。
# NOTE: the code examples assume you're operating within a Jupyter notebook.
# download files
!mkdir data
!wget "https://www.dropbox.com/s/948jr9cfs7fgj99/UBER.zip?dl=1" -O data/UBER.zip
!unzip data/UBER.zip -d data
为了将 HTML 文件解析为格式化文本,我们使用了 Unstructured 库。感谢 LlamaHub,我们可以直接与 Unstructured 集成,将任何文本转换为 LlamaIndex 可以摄取的文档格式。
首先,我们安装所需的软件包
然后,我们可以使用 UnstructuredReader
将 HTML 文件解析为 Document
对象列表。
from llama_index.readers.file import UnstructuredReader
from pathlib import Path
years = [2022, 2021, 2020, 2019]
loader = UnstructuredReader()
doc_set = {}
all_docs = []
for year in years:
year_docs = loader.load_data(
file=Path(f"./data/UBER/UBER_{year}.html"), split_documents=False
)
# insert year metadata into each year
for d in year_docs:
d.metadata = {"year": year}
doc_set[year] = year_docs
all_docs.extend(year_docs)
# initialize simple vector indices
# NOTE: don't run this cell if the indices are already loaded!
from llama_index.core import VectorStoreIndex, StorageContext
index_set = {}
for year in years:
storage_context = StorageContext.from_defaults()
cur_index = VectorStoreIndex.from_documents(
doc_set[year],
storage_context=storage_context,
)
index_set[year] = cur_index
storage_context.persist(persist_dir=f"./storage/{year}")
要从磁盘加载索引,请执行以下操作
# Load indices from disk
from llama_index.core import StorageContext, load_index_from_storage
index_set = {}
for year in years:
storage_context = StorageContext.from_defaults(
persist_dir=f"./storage/{year}"
)
cur_index = load_index_from_storage(
storage_context,
)
index_set[year] = cur_index
from llama_index.core.tools import QueryEngineTool
individual_query_engine_tools = [
QueryEngineTool.from_defaults(
query_engine=index_set[year].as_query_engine(),
name=f"vector_index_{year}",
description=(
"useful for when you want to answer queries about the"
f" {year} SEC 10-K for Uber"
),
)
for year in years
]
现在我们可以创建子问题查询引擎,它将允许我们合成跨 10-K 文件的答案。我们将上面定义的 individual_query_engine_tools
传入。
from llama_index.core.query_engine import SubQuestionQueryEngine
query_engine = SubQuestionQueryEngine.from_defaults(
query_engine_tools=individual_query_engine_tools,
)
设置聊天机器人代理¶
我们使用 LlamaIndex 数据代理来设置外部聊天机器人代理,该代理可以访问一组工具。具体来说,我们将使用 OpenAIAgent,它利用了 OpenAI API 的函数调用。我们希望使用之前为每个索引(对应于给定年份)定义的单独工具,以及上面定义的子问题查询引擎的工具。
首先,我们为子问题查询引擎定义一个 QueryEngineTool
query_engine_tool = QueryEngineTool.from_defaults(
query_engine=query_engine,
name="sub_question_query_engine",
description=(
"useful for when you want to answer queries that require analyzing"
" multiple SEC 10-K documents for Uber"
),
)
然后,我们将上面定义的工具组合成一个用于代理的工具列表
tools = individual_query_engine_tools + [query_engine_tool]
最后,我们调用 FunctionAgent
创建代理,传入上面定义的工具列表。
from llama_index.core.agent.workflow import FunctionAgent
from llama_index.llms.openai import OpenAI
agent = FunctionAgent(tools=tools, llm=OpenAI(model="gpt-4o"))
from llama_index.core.workflow import Context
# Setup the context for this specific interaction
ctx = Context(agent)
response = await agent.run("hi, i am bob", ctx=ctx)
print(str(response))
Hello Bob! How can I assist you today?
如果我们用有关给定年份 10-K 文件的查询进行测试,代理将使用相关的向量索引工具。
response = await agent.run(
"What were some of the biggest risk factors in 2020 for Uber?", ctx=ctx
)
print(str(response))
In 2020, some of the biggest risk factors for Uber included: 1. **Legal and Regulatory Risks**: Extensive government regulation and oversight could adversely impact operations and future prospects. 2. **Data Privacy and Security Risks**: Risks related to data collection, use, and processing could lead to investigations, litigation, and negative publicity. 3. **Economic Impact of COVID-19**: The pandemic adversely affected business operations, demand for services, and financial condition due to governmental restrictions and changes in consumer behavior. 4. **Market Volatility**: Volatility in the market price of common stock could affect investors' ability to resell shares at favorable prices. 5. **Safety Incidents**: Criminal or dangerous activities on the platform could harm the ability to attract and retain drivers and consumers. 6. **Investment Risks**: Substantial investments in new technologies and offerings carry inherent risks, with no guarantee of realizing expected benefits. 7. **Dependence on Metropolitan Areas**: A significant portion of gross bookings comes from large metropolitan areas, which may be negatively impacted by various external factors. 8. **Talent Retention**: Attracting and retaining high-quality personnel is crucial, and issues with attrition or succession planning could adversely affect the business. 9. **Cybersecurity Threats**: Cyberattacks and data breaches could harm reputation and operational results. 10. **Capital Requirements**: The need for additional capital to support growth may not be met on reasonable terms, impacting business expansion. 11. **Acquisition Challenges**: Difficulty in identifying and integrating suitable businesses could harm operating results and future prospects. 12. **Operational Limitations**: Potential restrictions in certain jurisdictions may require modifications to the business model, affecting service delivery.
最后,如果我们用跨年份比较/对比风险因素的查询进行测试,代理将使用子问题查询引擎工具。
cross_query_str = (
"Compare/contrast the risk factors described in the Uber 10-K across"
" years. Give answer in bullet points."
)
response = await agent.run(cross_query_str, ctx=ctx)
print(str(response))
Here's a comparison of the risk factors for Uber across the years 2020, 2021, and 2022: - **COVID-19 Impact**: - **2020**: The pandemic significantly affected business operations, demand, and financial condition. - **2021**: Continued impact of the pandemic was a concern, affecting various parts of the business. - **2022**: The pandemic's impact was less emphasized, with more focus on operational and competitive risks. - **Driver Classification**: - **2020**: Not specifically highlighted. - **2021**: Potential reclassification of Drivers as employees could alter the business model. - **2022**: Continued risk of reclassification impacting operational costs. - **Competition**: - **2020**: Not specifically highlighted. - **2021**: Intense competition with low barriers to entry and well-capitalized competitors. - **2022**: Competitive landscape challenges due to established alternatives and low barriers to entry. - **Financial Concerns**: - **2020**: Market volatility and capital requirements were major concerns. - **2021**: Historical losses and increased operating expenses raised profitability concerns. - **2022**: Significant losses and rising expenses continued to raise profitability concerns. - **User and Personnel Retention**: - **2020**: Talent retention was crucial, with risks from attrition. - **2021**: Attracting and retaining a critical mass of users and personnel was essential. - **2022**: Continued emphasis on retaining Drivers, consumers, and high-quality personnel. - **Brand and Reputation**: - **2020**: Safety incidents and cybersecurity threats could harm reputation. - **2021**: Maintaining and enhancing brand reputation was critical, with past negative publicity being a concern. - **2022**: Brand and reputation were under scrutiny, with negative media coverage potentially harming prospects. - **Operational Challenges**: - **2020**: Operational limitations and acquisition challenges were highlighted. - **2021**: Challenges in managing growth and optimizing organizational structure. - **2022**: Historical workplace culture and the need for organizational optimization were critical. - **Safety and Liability**: - **2020**: Safety incidents and liability claims were significant risks. - **2021**: Safety incidents and liability claims, especially with vulnerable road users, were concerns. - **2022**: Safety incidents and public reporting could impact reputation and financial results. Overall, while some risk factors remained consistent across the years, such as competition, financial concerns, and safety, the emphasis shifted slightly with the evolving business environment and external factors like the pandemic.
设置聊天机器人循环¶
现在我们已经设置好了聊天机器人,只需要再几个步骤就可以设置一个基本的交互循环来与我们的 SEC 增强聊天机器人聊天!
agent = FunctionAgent(tools=tools, llm=OpenAI(model="gpt-4o"))
ctx = Context(agent)
while True:
text_input = input("User: ")
if text_input == "exit":
break
response = await agent.run(text_input, ctx=ctx)
print(f"Agent: {response}")
# User: What were some of the legal proceedings against Uber in 2022?