Tonic Validate#
什么是 Tonic Validate#
Tonic Validate 是一款工具,供开发检索增强生成 (RAG) 系统的人员评估其系统的性能。您可以使用 Tonic Validate 对您的 LlamaIndex 设置性能进行一次性抽查,甚至可以在现有的 CI/CD 系统(如 Github Actions)中使用它。Tonic Validate 包含两个部分:
- 开源 SDK
- Web UI
如果您愿意,可以在不使用 Web UI 的情况下使用 SDK。SDK 包含评估您的 RAG 系统所需的所有工具。Web UI 的目的是在 SDK 之上提供一个可视化结果的层。这使您可以更好地了解系统的性能,而不仅仅是查看原始数字。
如果您想使用 Web UI,可以前往此处注册一个免费帐户。
如何使用 Tonic Validate#
设置 Tonic Validate#
您可以通过以下命令安装 Tonic Validate
pip install tonic-validate
要使用 Tonic Validate,您需要提供一个 OpenAI 密钥,因为分数计算在后端使用了 LLM。您可以通过将 OPENAI_API_KEY
环境变量设置为您的 OpenAI API 密钥来设置 OpenAI 密钥。
import os
os.environ["OPENAI_API_KEY"] = "put-your-openai-api-key-here"
如果您要将结果上传到 UI,请确保设置您在Web UI帐户设置期间收到的 Tonic Validate API 密钥。如果您尚未在 Web UI 上设置帐户,可以在此处设置。获得 API 密钥后,您可以通过 TONIC_VALIDATE_API_KEY
环境变量来设置它。
import os
os.environ["TONIC_VALIDATE_API_KEY"] = "put-your-validate-api-key-here"
单个问题使用示例#
在本例中,我们提供了一个问题示例,其参考正确答案与 LLM 响应答案不匹配。有两个检索到的上下文块,其中一个包含正确答案。
question = "What makes Sam Altman a good founder?"
reference_answer = "He is smart and has a great force of will."
llm_answer = "He is a good founder because he is smart."
retrieved_context_list = [
"Sam Altman is a good founder. He is very smart.",
"What makes Sam Altman such a good founder is his great force of will.",
]
答案相似度得分是介于 0 到 5 之间的分数,用于衡量 LLM 答案与参考答案的匹配程度。在此示例中,它们并非完全匹配,因此答案相似度得分不是完美的 5 分。
answer_similarity_evaluator = AnswerSimilarityEvaluator()
score = await answer_similarity_evaluator.aevaluate(
question,
llm_answer,
retrieved_context_list,
reference_response=reference_answer,
)
print(score)
# >> EvaluationResult(query='What makes Sam Altman a good founder?', contexts=['Sam Altman is a good founder. He is very smart.', 'What makes Sam Altman such a good founder is his great force of will.'], response='He is a good founder because he is smart.', passing=None, feedback=None, score=4.0, pairwise_source=None, invalid_result=False, invalid_reason=None)
答案一致性得分介于 0.0 到 1.0 之间,衡量答案是否包含未出现在检索到的上下文中的信息。在此示例中,答案确实出现在检索到的上下文中,因此得分为 1。
answer_consistency_evaluator = AnswerConsistencyEvaluator()
score = await answer_consistency_evaluator.aevaluate(
question, llm_answer, retrieved_context_list
)
print(score)
# >> EvaluationResult(query='What makes Sam Altman a good founder?', contexts=['Sam Altman is a good founder. He is very smart.', 'What makes Sam Altman such a good founder is his great force of will.'], response='He is a good founder because he is smart.', passing=None, feedback=None, score=1.0, pairwise_source=None, invalid_result=False, invalid_reason=None)
增强准确性衡量检索到的上下文在答案中的百分比。在此示例中,一个检索到的上下文在答案中,因此得分为 0.5。
augmentation_accuracy_evaluator = AugmentationAccuracyEvaluator()
score = await augmentation_accuracy_evaluator.aevaluate(
question, llm_answer, retrieved_context_list
)
print(score)
# >> EvaluationResult(query='What makes Sam Altman a good founder?', contexts=['Sam Altman is a good founder. He is very smart.', 'What makes Sam Altman such a good founder is his great force of will.'], response='He is a good founder because he is smart.', passing=None, feedback=None, score=0.5, pairwise_source=None, invalid_result=False, invalid_reason=None)
增强精度衡量相关检索到的上下文是否进入答案。两个检索到的上下文都相关,但只有一个进入了答案。因此,此得分为 0.5。
augmentation_precision_evaluator = AugmentationPrecisionEvaluator()
score = await augmentation_precision_evaluator.aevaluate(
question, llm_answer, retrieved_context_list
)
print(score)
# >> EvaluationResult(query='What makes Sam Altman a good founder?', contexts=['Sam Altman is a good founder. He is very smart.', 'What makes Sam Altman such a good founder is his great force of will.'], response='He is a good founder because he is smart.', passing=None, feedback=None, score=0.5, pairwise_source=None, invalid_result=False, invalid_reason=None)
检索精度衡量检索到的上下文与回答问题的相关程度的百分比。在此示例中,两个检索到的上下文都与回答问题相关,因此得分为 1.0。
retrieval_precision_evaluator = RetrievalPrecisionEvaluator()
score = await retrieval_precision_evaluator.aevaluate(
question, llm_answer, retrieved_context_list
)
print(score)
# >> EvaluationResult(query='What makes Sam Altman a good founder?', contexts=['Sam Altman is a good founder. He is very smart.', 'What makes Sam Altman such a good founder is his great force of will.'], response='He is a good founder because he is smart.', passing=None, feedback=None, score=1.0, pairwise_source=None, invalid_result=False, invalid_reason=None)
TonicValidateEvaluator 可以一次计算所有 Tonic Validate 的指标。
tonic_validate_evaluator = TonicValidateEvaluator()
scores = await tonic_validate_evaluator.aevaluate(
question,
llm_answer,
retrieved_context_list,
reference_response=reference_answer,
)
print(scores.score_dict)
# >> {
# 'answer_consistency': 1.0,
# 'answer_similarity': 4.0,
# 'augmentation_accuracy': 0.5,
# 'augmentation_precision': 0.5,
# 'retrieval_precision': 1.0
# }
一次评估多个问题#
您还可以使用 TonicValidateEvaluator 一次评估多个查询和响应,并返回一个 tonic_validate Run 对象,该对象可以记录到Tonic Validate UI中。
为此,您将问题、LLM 答案、检索到的上下文列表和参考答案放入列表中,并调用 evaluate_run。
questions = ["What is the capital of France?", "What is the capital of Spain?"]
reference_answers = ["Paris", "Madrid"]
llm_answer = ["Paris", "Madrid"]
retrieved_context_lists = [
[
"Paris is the capital and most populous city of France.",
"Paris, France's capital, is a major European city and a global center for art, fashion, gastronomy and culture.",
],
[
"Madrid is the capital and largest city of Spain.",
"Madrid, Spain's central capital, is a city of elegant boulevards and expansive, manicured parks such as the Buen Retiro.",
],
]
tonic_validate_evaluator = TonicValidateEvaluator()
scores = await tonic_validate_evaluator.aevaluate_run(
[questions], [llm_answers], [retrieved_context_lists], [reference_answers]
)
print(scores.run_data[0].scores)
# >> {
# 'answer_consistency': 1.0,
# 'answer_similarity': 3.0,
# 'augmentation_accuracy': 0.5,
# 'augmentation_precision': 0.5,
# 'retrieval_precision': 1.0
# }
将结果上传到 UI#
如果您想将分数上传到 UI,可以使用 Tonic Validate API。在此之前,请确保已按设置 Tonic Validate 部分所述设置了 TONIC_VALIDATE_API_KEY
。您还需要确保已在 Tonic Validate UI 中创建了一个项目,并且已复制了项目 ID。API 密钥和项目设置完成后,您可以初始化 Validate API 并上传结果。
validate_api = ValidateApi()
project_id = "your-project-id"
validate_api.upload_run(project_id, scores)
现在您可以在 Tonic Validate UI 中查看您的结果了!
端到端示例#
在这里,我们将向您展示如何将 Tonic Validate 与 Llama Index 一起端到端使用。首先,让我们使用 Llama Index CLI 下载 Llama Index 将要运行的数据集。
llamaindex-cli download-llamadataset EvaluatingLlmSurveyPaperDataset --download-dir ./data
现在,我们可以创建一个名为 llama.py
的 python 文件,并将以下代码放入其中。
from llama_index.core import SimpleDirectoryReader
from llama_index.core import VectorStoreIndex
documents = SimpleDirectoryReader(input_dir="./data/source_files").load_data()
index = VectorStoreIndex.from_documents(documents=documents)
query_engine = index.as_query_engine()
这段代码本质上只是加载数据集文件,然后初始化 Llama Index。
Llama Index 的 CLI 还下载了一个问题和答案列表,您可以使用它们在其示例数据集上进行测试。如果您想使用这些问题和答案,可以使用以下代码。
from llama_index.core.llama_dataset import LabelledRagDataset
rag_dataset = LabelledRagDataset.from_json("./data/rag_dataset.json")
# We are only going to do 10 questions as running through the full data set takes too long
questions = [item.query for item in rag_dataset.examples][:10]
reference_answers = [item.reference_answer for item in rag_dataset.examples][
:10
]
现在我们可以查询 Llama Index 的响应了。
llm_answers = []
retrieved_context_lists = []
for question in questions:
response = query_engine.query(question)
context_list = [x.text for x in response.source_nodes]
retrieved_context_lists.append(context_list)
llm_answers.append(response.response)
现在要对其评分,我们可以执行以下操作:
from tonic_validate.metrics import AnswerSimilarityMetric
from llama_index.evaluation.tonic_validate import TonicValidateEvaluator
tonic_validate_evaluator = TonicValidateEvaluator(
metrics=[AnswerSimilarityMetric()], model_evaluator="gpt-4-1106-preview"
)
scores = tonic_validate_evaluator.evaluate_run(
questions, retrieved_context_lists, reference_answers, llm_answers
)
print(scores.overall_scores)
如果您想将分数上传到 UI,可以使用 Tonic Validate API。在此之前,请确保已按设置 Tonic Validate 部分所述设置了 TONIC_VALIDATE_API_KEY
。您还需要确保已在 Tonic Validate UI 中创建了一个项目,并且已复制了项目 ID。API 密钥和项目设置完成后,您可以初始化 Validate API 并上传结果。
validate_api = ValidateApi()
project_id = "your-project-id"
validate_api.upload_run(project_id, run)
除了此处的文档,您还可以访问Tonic Validate 的 Github 页面,了解有关如何与我们的 API 交互以上传结果的更多文档。
除了这里的文档之外,您也可以访问 Tonic Validate 的 Github 页面,以获取更多关于如何与我们的 API 交互以上传结果的文档。