如何构建一个聊天机器人¶

LlamaIndex 作为您的数据与大型语言模型（LLMs）之间的桥梁，提供了一套工具包，使您能够围绕数据建立查询接口，用于问答和摘要等多种任务。

在本教程中，我们将引导您使用数据代理构建一个上下文增强的聊天机器人。这个由 LLM 驱动的代理能够智能地执行针对您数据的任务。最终结果是一个聊天机器人代理，它配备了 LlamaIndex 提供的一套强大的数据接口工具，用于回答关于您数据的查询。

注意：本教程基于对 SEC 10-K 文件创建查询接口的初步工作 - 点此查看。

上下文¶

在本指南中，我们将构建一个“10-K 聊天机器人”，它使用来自 Dropbox 的 UBER 原始 10-K HTML 文件。用户可以与聊天机器人互动，询问与 10-K 文件相关的问题。

准备工作¶

In [ ]

已复制!





%pip install llama-index-readers-file
%pip install llama-index-embeddings-openai
%pip install llama-index-agent-openai
%pip install llama-index-llms-openai
%pip install llama-index-question-gen-openai
%pip install unstructured
%pip install llama-index-readers-file %pip install llama-index-embeddings-openai %pip install llama-index-agent-openai %pip install llama-index-llms-openai %pip install llama-index-question-gen-openai %pip install unstructured

In [ ]

已复制!

import os

os.environ["OPENAI_API_KEY"] = "sk-..."
import os os.environ["OPENAI_API_KEY"] = "sk-..."

In [ ]

已复制!





from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

# global defaults
Settings.llm = OpenAI(model="gpt-4o-mini")
Settings.embed_model = OpenAIEmbedding(model_name="text-embedding-3-large")
Settings.chunk_size = 512
Settings.chunk_overlap = 64
from llama_index.core import Settings from llama_index.llms.openai import OpenAI from llama_index.embeddings.openai import OpenAIEmbedding # global defaults Settings.llm = OpenAI(model="gpt-4o-mini") Settings.embed_model = OpenAIEmbedding(model_name="text-embedding-3-large") Settings.chunk_size = 512 Settings.chunk_overlap = 64

摄取数据¶

首先，让我们下载 2019-2022 年的原始 10-K 文件。

In [ ]

已复制!

# NOTE: the code examples assume you're operating within a Jupyter notebook.
# download files
!mkdir data
!wget "https://www.dropbox.com/s/948jr9cfs7fgj99/UBER.zip?dl=1" -O data/UBER.zip
!unzip data/UBER.zip -d data
# NOTE: the code examples assume you're operating within a Jupyter notebook. # download files !mkdir data !wget "https://www.dropbox.com/s/948jr9cfs7fgj99/UBER.zip?dl=1" -O data/UBER.zip !unzip data/UBER.zip -d data

为了将 HTML 文件解析为格式化文本，我们使用了 Unstructured 库。感谢 LlamaHub，我们可以直接与 Unstructured 集成，将任何文本转换为 LlamaIndex 可以摄取的文档格式。

首先，我们安装所需的软件包

然后，我们可以使用 UnstructuredReader 将 HTML 文件解析为 Document 对象列表。

In [ ]

已复制!

from llama_index.readers.file import UnstructuredReader
from pathlib import Path

years = [2022, 2021, 2020, 2019]
from llama_index.readers.file import UnstructuredReader from pathlib import Path years = [2022, 2021, 2020, 2019]

In [ ]

已复制!





loader = UnstructuredReader()
doc_set = {}
all_docs = []
for year in years:
    year_docs = loader.load_data(
        file=Path(f"./data/UBER/UBER_{year}.html"), split_documents=False
    )
    # insert year metadata into each year
    for d in year_docs:
        d.metadata = {"year": year}
    doc_set[year] = year_docs
    all_docs.extend(year_docs)
loader = UnstructuredReader() doc_set = {} all_docs = [] for year in years: year_docs = loader.load_data( file=Path(f"./data/UBER/UBER_{year}.html"), split_documents=False ) # insert year metadata into each year for d in year_docs: d.metadata = {"year": year} doc_set[year] = year_docs all_docs.extend(year_docs)

为每年设置向量索引¶

我们首先为每年设置一个向量索引。每个向量索引都允许我们询问有关给定年份 10-K 文件的问题。

我们构建每个索引并将其保存到磁盘。

In [ ]

已复制!





# initialize simple vector indices
# NOTE: don't run this cell if the indices are already loaded!
from llama_index.core import VectorStoreIndex, StorageContext


index_set = {}
for year in years:
    storage_context = StorageContext.from_defaults()
    cur_index = VectorStoreIndex.from_documents(
        doc_set[year],
        storage_context=storage_context,
    )
    index_set[year] = cur_index
    storage_context.persist(persist_dir=f"./storage/{year}")
# initialize simple vector indices # NOTE: don't run this cell if the indices are already loaded! from llama_index.core import VectorStoreIndex, StorageContext index_set = {} for year in years: storage_context = StorageContext.from_defaults() cur_index = VectorStoreIndex.from_documents( doc_set[year], storage_context=storage_context, ) index_set[year] = cur_index storage_context.persist(persist_dir=f"./storage/{year}")

要从磁盘加载索引，请执行以下操作

In [ ]

已复制!





# Load indices from disk
from llama_index.core import StorageContext, load_index_from_storage

index_set = {}
for year in years:
    storage_context = StorageContext.from_defaults(
        persist_dir=f"./storage/{year}"
    )
    cur_index = load_index_from_storage(
        storage_context,
    )
    index_set[year] = cur_index
# Load indices from disk from llama_index.core import StorageContext, load_index_from_storage index_set = {} for year in years: storage_context = StorageContext.from_defaults( persist_dir=f"./storage/{year}" ) cur_index = load_index_from_storage( storage_context, ) index_set[year] = cur_index

设置一个子问题查询引擎以合成跨 10-K 文件的答案¶

由于我们可以访问 4 年的文档，我们可能不仅想问有关给定年份 10-K 文档的问题，还想问需要对所有 10-K 文件进行分析的问题。

为了解决这个问题，我们可以使用子问题查询引擎。它将一个查询分解为子查询，每个子查询由一个单独的向量索引回答，然后合成结果来回答总体查询。

LlamaIndex 提供了一些围绕索引（和查询引擎）的包装器，以便它们可以被查询引擎和代理使用。首先，我们为每个向量索引定义一个 QueryEngineTool。每个工具都有一个名称和描述；LLM 代理根据这些信息决定选择哪个工具。

In [ ]

已复制!





from llama_index.core.tools import QueryEngineTool

individual_query_engine_tools = [
    QueryEngineTool.from_defaults(
        query_engine=index_set[year].as_query_engine(),
        name=f"vector_index_{year}",
        description=(
            "useful for when you want to answer queries about the"
            f" {year} SEC 10-K for Uber"
        ),
    )
    for year in years
]
from llama_index.core.tools import QueryEngineTool individual_query_engine_tools = [ QueryEngineTool.from_defaults( query_engine=index_set[year].as_query_engine(), name=f"vector_index_{year}", description=( "useful for when you want to answer queries about the" f" {year} SEC 10-K for Uber" ), ) for year in years ]

现在我们可以创建子问题查询引擎，它将允许我们合成跨 10-K 文件的答案。我们将上面定义的 individual_query_engine_tools 传入。

In [ ]

已复制!

from llama_index.core.query_engine import SubQuestionQueryEngine

query_engine = SubQuestionQueryEngine.from_defaults(
    query_engine_tools=individual_query_engine_tools,
)
from llama_index.core.query_engine import SubQuestionQueryEngine query_engine = SubQuestionQueryEngine.from_defaults( query_engine_tools=individual_query_engine_tools, )

设置聊天机器人代理¶

我们使用 LlamaIndex 数据代理来设置外部聊天机器人代理，该代理可以访问一组工具。具体来说，我们将使用 OpenAIAgent，它利用了 OpenAI API 的函数调用。我们希望使用之前为每个索引（对应于给定年份）定义的单独工具，以及上面定义的子问题查询引擎的工具。

首先，我们为子问题查询引擎定义一个 QueryEngineTool

In [ ]

已复制!





query_engine_tool = QueryEngineTool.from_defaults(
    query_engine=query_engine,
    name="sub_question_query_engine",
    description=(
        "useful for when you want to answer queries that require analyzing"
        " multiple SEC 10-K documents for Uber"
    ),
)
query_engine_tool = QueryEngineTool.from_defaults( query_engine=query_engine, name="sub_question_query_engine", description=( "useful for when you want to answer queries that require analyzing" " multiple SEC 10-K documents for Uber" ), )

然后，我们将上面定义的工具组合成一个用于代理的工具列表

In [ ]

已复制!

tools = individual_query_engine_tools + [query_engine_tool]
tools = individual_query_engine_tools + [query_engine_tool]

最后，我们调用 FunctionAgent 创建代理，传入上面定义的工具列表。

In [ ]

已复制!

from llama_index.core.agent.workflow import FunctionAgent
from llama_index.llms.openai import OpenAI

agent = FunctionAgent(tools=tools, llm=OpenAI(model="gpt-4o"))
from llama_index.core.agent.workflow import FunctionAgent from llama_index.llms.openai import OpenAI agent = FunctionAgent(tools=tools, llm=OpenAI(model="gpt-4o"))

测试代理¶

现在我们可以使用各种查询来测试代理。

如果我们用简单的“hello”查询进行测试，代理将不会使用任何工具。

In [ ]

已复制!

from llama_index.core.workflow import Context

# Setup the context for this specific interaction
ctx = Context(agent)

response = await agent.run("hi, i am bob", ctx=ctx)
print(str(response))
from llama_index.core.workflow import Context # Setup the context for this specific interaction ctx = Context(agent) response = await agent.run("hi, i am bob", ctx=ctx) print(str(response))

Hello Bob! How can I assist you today?

如果我们用有关给定年份 10-K 文件的查询进行测试，代理将使用相关的向量索引工具。

In [ ]

已复制!

response = await agent.run(
    "What were some of the biggest risk factors in 2020 for Uber?", ctx=ctx
)
print(str(response))
response = await agent.run( "What were some of the biggest risk factors in 2020 for Uber?", ctx=ctx ) print(str(response))

In 2020, some of the biggest risk factors for Uber included:

1. **Legal and Regulatory Risks**: Extensive government regulation and oversight could adversely impact operations and future prospects.
2. **Data Privacy and Security Risks**: Risks related to data collection, use, and processing could lead to investigations, litigation, and negative publicity.
3. **Economic Impact of COVID-19**: The pandemic adversely affected business operations, demand for services, and financial condition due to governmental restrictions and changes in consumer behavior.
4. **Market Volatility**: Volatility in the market price of common stock could affect investors' ability to resell shares at favorable prices.
5. **Safety Incidents**: Criminal or dangerous activities on the platform could harm the ability to attract and retain drivers and consumers.
6. **Investment Risks**: Substantial investments in new technologies and offerings carry inherent risks, with no guarantee of realizing expected benefits.
7. **Dependence on Metropolitan Areas**: A significant portion of gross bookings comes from large metropolitan areas, which may be negatively impacted by various external factors.
8. **Talent Retention**: Attracting and retaining high-quality personnel is crucial, and issues with attrition or succession planning could adversely affect the business.
9. **Cybersecurity Threats**: Cyberattacks and data breaches could harm reputation and operational results.
10. **Capital Requirements**: The need for additional capital to support growth may not be met on reasonable terms, impacting business expansion.
11. **Acquisition Challenges**: Difficulty in identifying and integrating suitable businesses could harm operating results and future prospects.
12. **Operational Limitations**: Potential restrictions in certain jurisdictions may require modifications to the business model, affecting service delivery.

最后，如果我们用跨年份比较/对比风险因素的查询进行测试，代理将使用子问题查询引擎工具。

In [ ]

已复制!

cross_query_str = (
    "Compare/contrast the risk factors described in the Uber 10-K across"
    " years. Give answer in bullet points."
)

response = await agent.run(cross_query_str, ctx=ctx)
print(str(response))
cross_query_str = ( "Compare/contrast the risk factors described in the Uber 10-K across" " years. Give answer in bullet points." ) response = await agent.run(cross_query_str, ctx=ctx) print(str(response))

Here's a comparison of the risk factors for Uber across the years 2020, 2021, and 2022:

- **COVID-19 Impact**:
- **2020**: The pandemic significantly affected business operations, demand, and financial condition.
- **2021**: Continued impact of the pandemic was a concern, affecting various parts of the business.
- **2022**: The pandemic's impact was less emphasized, with more focus on operational and competitive risks.

- **Driver Classification**:
- **2020**: Not specifically highlighted.
- **2021**: Potential reclassification of Drivers as employees could alter the business model.
- **2022**: Continued risk of reclassification impacting operational costs.

- **Competition**:
- **2020**: Not specifically highlighted.
- **2021**: Intense competition with low barriers to entry and well-capitalized competitors.
- **2022**: Competitive landscape challenges due to established alternatives and low barriers to entry.

- **Financial Concerns**:
- **2020**: Market volatility and capital requirements were major concerns.
- **2021**: Historical losses and increased operating expenses raised profitability concerns.
- **2022**: Significant losses and rising expenses continued to raise profitability concerns.

- **User and Personnel Retention**:
- **2020**: Talent retention was crucial, with risks from attrition.
- **2021**: Attracting and retaining a critical mass of users and personnel was essential.
- **2022**: Continued emphasis on retaining Drivers, consumers, and high-quality personnel.

- **Brand and Reputation**:
- **2020**: Safety incidents and cybersecurity threats could harm reputation.
- **2021**: Maintaining and enhancing brand reputation was critical, with past negative publicity being a concern.
- **2022**: Brand and reputation were under scrutiny, with negative media coverage potentially harming prospects.

- **Operational Challenges**:
- **2020**: Operational limitations and acquisition challenges were highlighted.
- **2021**: Challenges in managing growth and optimizing organizational structure.
- **2022**: Historical workplace culture and the need for organizational optimization were critical.

- **Safety and Liability**:
- **2020**: Safety incidents and liability claims were significant risks.
- **2021**: Safety incidents and liability claims, especially with vulnerable road users, were concerns.
- **2022**: Safety incidents and public reporting could impact reputation and financial results.

Overall, while some risk factors remained consistent across the years, such as competition, financial concerns, and safety, the emphasis shifted slightly with the evolving business environment and external factors like the pandemic.

设置聊天机器人循环¶

现在我们已经设置好了聊天机器人，只需要再几个步骤就可以设置一个基本的交互循环来与我们的 SEC 增强聊天机器人聊天！

In [ ]

已复制!





agent = FunctionAgent(tools=tools, llm=OpenAI(model="gpt-4o"))
ctx = Context(agent)

while True:
    text_input = input("User: ")
    if text_input == "exit":
        break
    response = await agent.run(text_input, ctx=ctx)
    print(f"Agent: {response}")

# User: What were some of the legal proceedings against Uber in 2022?
agent = FunctionAgent(tools=tools, llm=OpenAI(model="gpt-4o")) ctx = Context(agent) while True: text_input = input("User: ") if text_input == "exit": break response = await agent.run(text_input, ctx=ctx) print(f"Agent: {response}") # User: What were some of the legal proceedings against Uber in 2022?