使用 Arize Phoenix 进行可观测性 - 追踪和评估 LlamaIndex 应用程序
LlamaIndex 提供了高级 API,使用户能够用几行代码构建强大的应用程序。然而,理解底层发生了什么以及找出问题的原因可能会很困难。Phoenix 通过可视化对查询引擎的每次调用的底层结构,并根据延迟、token 计数或其他评估指标突出显示有问题的执行 span
,从而使您的 LLM 应用程序具备*可观测性*。
在本教程中,您将学习如何:
- 使用 LlamaIndex 构建一个简单的查询引擎,该引擎使用检索增强生成来回答关于 Paul Graham 散文的问题,
- 使用全局
arize_phoenix
处理程序以 OpenInference 追踪格式记录追踪数据 - 检查应用程序的追踪和 span,以识别延迟和成本来源,
- 将追踪数据导出为 pandas DataFrame,并运行 LLM 评估。
ℹ️ 本 notebook 需要一个 OpenAI API 密钥。
1. 安装依赖项并导入库¶
安装 Phoenix、LlamaIndex 和 OpenAI。
!pip install llama-index
!pip install llama-index-callbacks-arize-phoenix
!pip install arize-phoenix[evals]
!pip install "openinference-instrumentation-llama-index>=1.0.0"
import json
import os
from getpass import getpass
from urllib.request import urlopen
import nest_asyncio
import openai
import pandas as pd
import phoenix as px
from llama_index.core import (
Settings,
set_global_handler,
)
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from phoenix.evals import (
HallucinationEvaluator,
OpenAIModel,
QAEvaluator,
RelevanceEvaluator,
run_evals,
)
from phoenix.session.evaluation import (
get_qa_with_reference,
get_retrieved_documents,
)
from phoenix.trace import DocumentEvaluations, SpanEvaluations
from tqdm import tqdm
nest_asyncio.apply()
pd.set_option("display.max_colwidth", 1000)
session = px.launch_app()
🌍 To view the Phoenix app in your browser, visit https://jfgzmj4xrg3-496ff2e9c6d22116-6006-colab.googleusercontent.com/ 📺 To view the Phoenix app in a notebook, run `px.active_session().view()` 📖 For more information on how to use Phoenix, check out https://docs.arize.com/phoenix
3. 配置您的 OpenAI API 密钥¶
如果您的 OpenAI API 密钥尚未设置为环境变量,请进行设置。
import os
os.environ["OPENAI_API_KEY"] = "sk-..."
下载数据¶
!wget "https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt" "paul_graham_essay.txt"
--2024-04-26 03:09:56-- https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ... Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 75042 (73K) [text/plain] Saving to: ‘paul_graham_essay.txt’ paul_graham_essay.t 100%[===================>] 73.28K --.-KB/s in 0.01s 2024-04-26 03:09:56 (5.58 MB/s) - ‘paul_graham_essay.txt’ saved [75042/75042] --2024-04-26 03:09:56-- http://paul_graham_essay.txt/ Resolving paul_graham_essay.txt (paul_graham_essay.txt)... failed: Name or service not known. wget: unable to resolve host address ‘paul_graham_essay.txt’ FINISHED --2024-04-26 03:09:56-- Total wall clock time: 0.2s Downloaded: 1 files, 73K in 0.01s (5.58 MB/s)
加载数据¶
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
documents = SimpleDirectoryReader(
input_files=["paul_graham_essay.txt"]
).load_data()
设置 Phoenix 追踪¶
通过将 arize_phoenix
设置为全局处理程序,在 LlamaIndex 中启用 Phoenix 追踪。这将把 Phoenix 的 OpenInferenceTraceCallback 作为全局处理程序挂载。Phoenix 使用 OpenInference 追踪 - 一个用于捕获和存储 LLM 应用程序追踪的开源标准,它使 LLM 应用程序能够与 Phoenix 等 LLM 可观测性解决方案无缝集成。
set_global_handler("arize_phoenix")
设置 LLM 和 Embedding 模型¶
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings
llm = OpenAI(model="gpt-3.5-turbo", temperature=0.2)
embed_model = OpenAIEmbedding()
Settings.llm = llm
Settings.embed_model = embed_model
创建索引¶
from llama_index.core import VectorStoreIndex
index = VectorStoreIndex.from_documents(documents)
创建查询引擎。¶
query_engine = index.as_query_engine(similarity_top_k=5)
5. 运行查询引擎并在 Phoenix 中查看追踪¶
queries = [
"what did paul graham do growing up?",
"why did paul graham start YC?",
]
for query in tqdm(queries):
query_engine.query(query)
100%|██████████| 2/2 [00:07<00:00, 3.81s/it]
print(query_engine.query("Who is Paul Graham?"))
Paul Graham is a writer, entrepreneur, and investor known for his involvement in various projects and ventures. He has written essays on diverse topics, founded companies like Viaweb and Y Combinator, and has a strong presence in the startup and technology industry.
print(f"🚀 Open the Phoenix UI if you haven't already: {session.url}")
🚀 Open the Phoenix UI if you haven't already: https://jfgzmj4xrg4-496ff2e9c6d22116-6006-colab.googleusercontent.com/
6. 导出并评估您的追踪数据¶
您可以将您的追踪数据导出为 pandas DataFrame,以便进行进一步分析和评估。
在本例中,我们将把 retriever
span 导出到两个独立的 DataFrame 中:
queries_df
,其中每个查询的检索到的文档被串联到一个单独的列中,retrieved_documents_df
,其中每个检索到的文档被“展开”到自己的行中,以独立评估每个查询-文档对。
这将使我们能够计算多种类型的评估,包括:
- 相关性:检索到的文档是否基于响应?
- 问答正确性:您的应用程序的响应是否基于检索到的上下文?
- 幻觉:您的应用程序是否正在编造虚假信息?
queries_df = get_qa_with_reference(px.Client())
retrieved_documents_df = get_retrieved_documents(px.Client())
接下来,定义您的评估模型和评估器。
评估器构建在语言模型之上,并提示 LLM 评估响应的质量、检索到的文档的相关性等,即使在没有人工标注数据的情况下也能提供质量信号。选择一个评估器类型,并使用您想要用于执行评估的语言模型实例化它,利用我们经过实战验证的评估模板。
eval_model = OpenAIModel(
model="gpt-4",
)
hallucination_evaluator = HallucinationEvaluator(eval_model)
qa_correctness_evaluator = QAEvaluator(eval_model)
relevance_evaluator = RelevanceEvaluator(eval_model)
hallucination_eval_df, qa_correctness_eval_df = run_evals(
dataframe=queries_df,
evaluators=[hallucination_evaluator, qa_correctness_evaluator],
provide_explanation=True,
)
relevance_eval_df = run_evals(
dataframe=retrieved_documents_df,
evaluators=[relevance_evaluator],
provide_explanation=True,
)[0]
px.Client().log_evaluations(
SpanEvaluations(
eval_name="Hallucination", dataframe=hallucination_eval_df
),
SpanEvaluations(
eval_name="QA Correctness", dataframe=qa_correctness_eval_df
),
DocumentEvaluations(eval_name="Relevance", dataframe=relevance_eval_df),
)
run_evals | | 0/6 (0.0%) | ⏳ 00:00<? | ?it/s
run_evals | | 0/15 (0.0%) | ⏳ 00:00<? | ?it/s
有关 Phoenix、LLM 追踪和 LLM 评估的更多详情,请查阅文档。