使用 Arize Phoenix 进行可观测性 - 追踪和评估 LlamaIndex 应用程序

LlamaIndex 提供了高级 API，使用户能够用几行代码构建强大的应用程序。然而，理解底层发生了什么以及找出问题的原因可能会很困难。Phoenix 通过可视化对查询引擎的每次调用的底层结构，并根据延迟、token 计数或其他评估指标突出显示有问题的执行 span，从而使您的 LLM 应用程序具备*可观测性*。

在本教程中，您将学习如何：

使用 LlamaIndex 构建一个简单的查询引擎，该引擎使用检索增强生成来回答关于 Paul Graham 散文的问题，
使用全局 arize_phoenix 处理程序以 OpenInference 追踪格式记录追踪数据
检查应用程序的追踪和 span，以识别延迟和成本来源，
将追踪数据导出为 pandas DataFrame，并运行 LLM 评估。

ℹ️ 本 notebook 需要一个 OpenAI API 密钥。

可观测性文档

1. 安装依赖项并导入库¶

安装 Phoenix、LlamaIndex 和 OpenAI。

In [ ]

已复制!

!pip install llama-index
!pip install llama-index-callbacks-arize-phoenix
!pip install arize-phoenix[evals]
!pip install "openinference-instrumentation-llama-index>=1.0.0"
!pip install llama-index !pip install llama-index-callbacks-arize-phoenix !pip install arize-phoenix[evals] !pip install "openinference-instrumentation-llama-index>=1.0.0"

In [ ]

已复制!





import json
import os
from getpass import getpass
from urllib.request import urlopen

import nest_asyncio
import openai
import pandas as pd
import phoenix as px
from llama_index.core import (
    Settings,
    set_global_handler,
)
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from phoenix.evals import (
    HallucinationEvaluator,
    OpenAIModel,
    QAEvaluator,
    RelevanceEvaluator,
    run_evals,
)
from phoenix.session.evaluation import (
    get_qa_with_reference,
    get_retrieved_documents,
)
from phoenix.trace import DocumentEvaluations, SpanEvaluations
from tqdm import tqdm

nest_asyncio.apply()
pd.set_option("display.max_colwidth", 1000)
import json import os from getpass import getpass from urllib.request import urlopen import nest_asyncio import openai import pandas as pd import phoenix as px from llama_index.core import ( Settings, set_global_handler, ) from llama_index.embeddings.openai import OpenAIEmbedding from llama_index.llms.openai import OpenAI from phoenix.evals import ( HallucinationEvaluator, OpenAIModel, QAEvaluator, RelevanceEvaluator, run_evals, ) from phoenix.session.evaluation import ( get_qa_with_reference, get_retrieved_documents, ) from phoenix.trace import DocumentEvaluations, SpanEvaluations from tqdm import tqdm nest_asyncio.apply() pd.set_option("display.max_colwidth", 1000)

2. 启动 Phoenix¶

您可以在后台运行 Phoenix，以收集已通过 OpenInferenceTraceCallbackHandler 进行检测的任何 LlamaIndex 应用程序发出的追踪数据。Phoenix 支持 LlamaIndex 的一键可观测性，这将自动检测您的 LlamaIndex 应用程序！您可以查阅我们的集成指南，了解如何检测您的 LlamaIndex 应用程序的更详细说明。

启动 Phoenix，并按照单元格输出中的说明打开 Phoenix UI（UI 应该为空，因为我们尚未运行 LlamaIndex 应用程序）。

In [ ]

已复制!

session = px.launch_app()
session = px.launch_app()

🌍 To view the Phoenix app in your browser, visit https://jfgzmj4xrg3-496ff2e9c6d22116-6006-colab.googleusercontent.com/
📺 To view the Phoenix app in a notebook, run `px.active_session().view()`
📖 For more information on how to use Phoenix, check out https://docs.arize.com/phoenix

3. 配置您的 OpenAI API 密钥¶

如果您的 OpenAI API 密钥尚未设置为环境变量，请进行设置。

In [ ]

已复制!

import os

os.environ["OPENAI_API_KEY"] = "sk-..."
import os os.environ["OPENAI_API_KEY"] = "sk-..."

4. 构建索引并创建查询引擎¶

a. 下载数据

b. 加载数据

c. 设置 Phoenix 追踪

d. 设置 LLM 和 Embedding 模型

e. 创建索引

f. 创建查询引擎

下载数据¶

In [ ]

已复制!

!wget "https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt" "paul_graham_essay.txt"
!wget "https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt" "paul_graham_essay.txt"

--2024-04-26 03:09:56--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75042 (73K) [text/plain]
Saving to: ‘paul_graham_essay.txt’

paul_graham_essay.t 100%[===================>]  73.28K  --.-KB/s    in 0.01s   

2024-04-26 03:09:56 (5.58 MB/s) - ‘paul_graham_essay.txt’ saved [75042/75042]

--2024-04-26 03:09:56--  http://paul_graham_essay.txt/
Resolving paul_graham_essay.txt (paul_graham_essay.txt)... failed: Name or service not known.
wget: unable to resolve host address ‘paul_graham_essay.txt’
FINISHED --2024-04-26 03:09:56--
Total wall clock time: 0.2s
Downloaded: 1 files, 73K in 0.01s (5.58 MB/s)

加载数据¶

In [ ]

已复制!

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader(
    input_files=["paul_graham_essay.txt"]
).load_data()
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader documents = SimpleDirectoryReader( input_files=["paul_graham_essay.txt"] ).load_data()

设置 Phoenix 追踪¶

通过将 arize_phoenix 设置为全局处理程序，在 LlamaIndex 中启用 Phoenix 追踪。这将把 Phoenix 的 OpenInferenceTraceCallback 作为全局处理程序挂载。Phoenix 使用 OpenInference 追踪 - 一个用于捕获和存储 LLM 应用程序追踪的开源标准，它使 LLM 应用程序能够与 Phoenix 等 LLM 可观测性解决方案无缝集成。

In [ ]

已复制!

set_global_handler("arize_phoenix")
set_global_handler("arize_phoenix")

设置 LLM 和 Embedding 模型¶

In [ ]

已复制!

from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings

llm = OpenAI(model="gpt-3.5-turbo", temperature=0.2)
embed_model = OpenAIEmbedding()

Settings.llm = llm
Settings.embed_model = embed_model
from llama_index.embeddings.openai import OpenAIEmbedding from llama_index.llms.openai import OpenAI from llama_index.core import Settings llm = OpenAI(model="gpt-3.5-turbo", temperature=0.2) embed_model = OpenAIEmbedding() Settings.llm = llm Settings.embed_model = embed_model

创建索引¶

In [ ]

已复制!

from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(documents)
from llama_index.core import VectorStoreIndex index = VectorStoreIndex.from_documents(documents)

创建查询引擎。¶

In [ ]

已复制!

query_engine = index.as_query_engine(similarity_top_k=5)
query_engine = index.as_query_engine(similarity_top_k=5)

5. 运行查询引擎并在 Phoenix 中查看追踪¶

In [ ]

已复制!

queries = [
    "what did paul graham do growing up?",
    "why did paul graham start YC?",
]
queries = [ "what did paul graham do growing up?", "why did paul graham start YC?", ]

In [ ]

已复制!

for query in tqdm(queries):
    query_engine.query(query)
for query in tqdm(queries): query_engine.query(query)

100%|██████████| 2/2 [00:07<00:00,  3.81s/it]

In [ ]

已复制!

print(query_engine.query("Who is Paul Graham?"))
print(query_engine.query("Who is Paul Graham?"))

Paul Graham is a writer, entrepreneur, and investor known for his involvement in various projects and ventures. He has written essays on diverse topics, founded companies like Viaweb and Y Combinator, and has a strong presence in the startup and technology industry.

In [ ]

已复制!

print(f"🚀 Open the Phoenix UI if you haven't already: {session.url}")
print(f"🚀 如果您尚未打开 Phoenix UI，请打开：{session.url}")

🚀 Open the Phoenix UI if you haven't already: https://jfgzmj4xrg4-496ff2e9c6d22116-6006-colab.googleusercontent.com/

6. 导出并评估您的追踪数据¶

您可以将您的追踪数据导出为 pandas DataFrame，以便进行进一步分析和评估。

在本例中，我们将把 retriever span 导出到两个独立的 DataFrame 中：

queries_df，其中每个查询的检索到的文档被串联到一个单独的列中，
retrieved_documents_df，其中每个检索到的文档被“展开”到自己的行中，以独立评估每个查询-文档对。

这将使我们能够计算多种类型的评估，包括：

相关性：检索到的文档是否基于响应？
问答正确性：您的应用程序的响应是否基于检索到的上下文？
幻觉：您的应用程序是否正在编造虚假信息？

In [ ]

已复制!

queries_df = get_qa_with_reference(px.Client())
retrieved_documents_df = get_retrieved_documents(px.Client())
queries_df = get_qa_with_reference(px.Client()) retrieved_documents_df = get_retrieved_documents(px.Client())

接下来，定义您的评估模型和评估器。

评估器构建在语言模型之上，并提示 LLM 评估响应的质量、检索到的文档的相关性等，即使在没有人工标注数据的情况下也能提供质量信号。选择一个评估器类型，并使用您想要用于执行评估的语言模型实例化它，利用我们经过实战验证的评估模板。

In [ ]

已复制!





eval_model = OpenAIModel(
    model="gpt-4",
)
hallucination_evaluator = HallucinationEvaluator(eval_model)
qa_correctness_evaluator = QAEvaluator(eval_model)
relevance_evaluator = RelevanceEvaluator(eval_model)

hallucination_eval_df, qa_correctness_eval_df = run_evals(
    dataframe=queries_df,
    evaluators=[hallucination_evaluator, qa_correctness_evaluator],
    provide_explanation=True,
)
relevance_eval_df = run_evals(
    dataframe=retrieved_documents_df,
    evaluators=[relevance_evaluator],
    provide_explanation=True,
)[0]

px.Client().log_evaluations(
    SpanEvaluations(
        eval_name="Hallucination", dataframe=hallucination_eval_df
    ),
    SpanEvaluations(
        eval_name="QA Correctness", dataframe=qa_correctness_eval_df
    ),
    DocumentEvaluations(eval_name="Relevance", dataframe=relevance_eval_df),
)
eval_model = OpenAIModel( model="gpt-4", ) hallucination_evaluator = HallucinationEvaluator(eval_model) qa_correctness_evaluator = QAEvaluator(eval_model) relevance_evaluator = RelevanceEvaluator(eval_model) hallucination_eval_df, qa_correctness_eval_df = run_evals( dataframe=queries_df, evaluators=[hallucination_evaluator, qa_correctness_evaluator], provide_explanation=True, ) relevance_eval_df = run_evals( dataframe=retrieved_documents_df, evaluators=[relevance_evaluator], provide_explanation=True, )[0] px.Client().log_evaluations( SpanEvaluations( eval_name="Hallucination", dataframe=hallucination_eval_df ), SpanEvaluations( eval_name="QA Correctness", dataframe=qa_correctness_eval_df ), DocumentEvaluations(eval_name="Relevance", dataframe=relevance_eval_df), )

run_evals |          | 0/6 (0.0%) | ⏳ 00:00<? | ?it/s

run_evals |          | 0/15 (0.0%) | ⏳ 00:00<? | ?it/s

有关 Phoenix、LLM 追踪和 LLM 评估的更多详情，请查阅文档。