Tonic Validate 评估器¶
本 Notebook 提供了一些基本用法示例,展示如何在 LlamaIndex 中使用 Tonic Validate 的 RAG 指标。要使用这些评估器,您需要安装 tonic_validate
,可以通过 pip install tonic-validate
进行安装。
%pip install llama-index-evaluation-tonic-validate
import json
import pandas as pd
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.evaluation.tonic_validate import (
AnswerConsistencyEvaluator,
AnswerSimilarityEvaluator,
AugmentationAccuracyEvaluator,
AugmentationPrecisionEvaluator,
RetrievalPrecisionEvaluator,
TonicValidateEvaluator,
)
单个问题使用示例¶
对于此示例,我们有一个问题,带有参考正确答案,但该答案与 LLM 响应答案不匹配。有两个检索到的上下文块,其中一个包含正确答案。
question = "What makes Sam Altman a good founder?"
reference_answer = "He is smart and has a great force of will."
llm_answer = "He is a good founder because he is smart."
retrieved_context_list = [
"Sam Altman is a good founder. He is very smart.",
"What makes Sam Altman such a good founder is his great force of will.",
]
答案相似度得分是一个介于 0 到 5 之间的分数,衡量 LLM 答案与参考答案的匹配程度。在此示例中,它们不完全匹配,因此答案相似度得分不是完美的 5 分。
answer_similarity_evaluator = AnswerSimilarityEvaluator()
score = await answer_similarity_evaluator.aevaluate(
question,
llm_answer,
retrieved_context_list,
reference_response=reference_answer,
)
score
EvaluationResult(query='What makes Sam Altman a good founder?', contexts=['Sam Altman is a good founder. He is very smart.', 'What makes Sam Altman such a good founder is his great force of will.'], response='He is a good founder because he is smart.', passing=None, feedback=None, score=4.0, pairwise_source=None, invalid_result=False, invalid_reason=None)
答案一致性分数在 0.0 到 1.0 之间,衡量答案是否包含未出现在检索到的上下文中的信息。在这种情况下,答案确实出现在检索到的上下文中,因此分数为 1。
answer_consistency_evaluator = AnswerConsistencyEvaluator()
score = await answer_consistency_evaluator.aevaluate(
question, llm_answer, retrieved_context_list
)
score
EvaluationResult(query='What makes Sam Altman a good founder?', contexts=['Sam Altman is a good founder. He is very smart.', 'What makes Sam Altman such a good founder is his great force of will.'], response='He is a good founder because he is smart.', passing=None, feedback=None, score=1.0, pairwise_source=None, invalid_result=False, invalid_reason=None)
增强准确性衡量检索到的上下文中出现在答案中的百分比。在这种情况下,检索到的上下文之一出现在答案中,因此此分数为 0.5。
augmentation_accuracy_evaluator = AugmentationAccuracyEvaluator()
score = await augmentation_accuracy_evaluator.aevaluate(
question, llm_answer, retrieved_context_list
)
score
EvaluationResult(query='What makes Sam Altman a good founder?', contexts=['Sam Altman is a good founder. He is very smart.', 'What makes Sam Altman such a good founder is his great force of will.'], response='He is a good founder because he is smart.', passing=None, feedback=None, score=0.5, pairwise_source=None, invalid_result=False, invalid_reason=None)
增强精确性衡量相关的检索到的上下文是否进入答案。检索到的上下文都相关,但只有一个进入答案。因此,此分数为 0.5。
augmentation_precision_evaluator = AugmentationPrecisionEvaluator()
score = await augmentation_precision_evaluator.aevaluate(
question, llm_answer, retrieved_context_list
)
score
EvaluationResult(query='What makes Sam Altman a good founder?', contexts=['Sam Altman is a good founder. He is very smart.', 'What makes Sam Altman such a good founder is his great force of will.'], response='He is a good founder because he is smart.', passing=None, feedback=None, score=0.5, pairwise_source=None, invalid_result=False, invalid_reason=None)
检索精确性衡量检索到的上下文中与回答问题相关的百分比。在这种情况下,检索到的上下文都与回答问题相关,因此分数为 1.0。
retrieval_precision_evaluator = RetrievalPrecisionEvaluator()
score = await retrieval_precision_evaluator.aevaluate(
question, llm_answer, retrieved_context_list
)
score
EvaluationResult(query='What makes Sam Altman a good founder?', contexts=['Sam Altman is a good founder. He is very smart.', 'What makes Sam Altman such a good founder is his great force of will.'], response='He is a good founder because he is smart.', passing=None, feedback=None, score=1.0, pairwise_source=None, invalid_result=False, invalid_reason=None)
TonicValidateEvaluator
可以一次计算 Tonic Validate 的所有指标。
tonic_validate_evaluator = TonicValidateEvaluator()
scores = await tonic_validate_evaluator.aevaluate(
question,
llm_answer,
retrieved_context_list,
reference_response=reference_answer,
)
scores.score_dict
{'answer_consistency': 1.0, 'answer_similarity': 4.0, 'augmentation_accuracy': 0.5, 'augmentation_precision': 0.5, 'retrieval_precision': 1.0}
您还可以使用 TonicValidateEvaluator
一次评估多个查询和响应,并返回一个 tonic_validate
Run
对象,该对象可以记录到 Tonic Validate UI (validate.tonic.ai)。
为此,您将问题、LLM 答案、检索到的上下文列表和参考答案放入列表中,然后调用 evaluate_run
。
tonic_validate_evaluator = TonicValidateEvaluator()
scores = await tonic_validate_evaluator.aevaluate_run(
[question], [llm_answer], [retrieved_context_list], [reference_answer]
)
scores.run_data[0].scores
{'answer_consistency': 1.0, 'answer_similarity': 3.0, 'augmentation_accuracy': 0.5, 'augmentation_precision': 0.5, 'retrieval_precision': 1.0}
标注 RAG 数据集示例¶
让我们使用数据集 EvaluatingLlmSurveyPaperDataset
并使用 Tonic Validate 的答案相似性分数评估默认的 LlamaIndex RAG 系统。EvaluatingLlmSurveyPaperDataset
是一个 LabelledRagDataset
,因此它包含每个问题的参考正确答案。该数据集包含关于论文 Evaluating Large Language Models: A Comprehensive Survey 的 276 个问题和参考答案。
我们将使用带有答案相似性分数指标的 TonicValidateEvaluator
来评估此数据集上默认 RAG 系统的响应。
!llamaindex-cli download-llamadataset EvaluatingLlmSurveyPaperDataset --download-dir ./data
from llama_index.core import SimpleDirectoryReader
from llama_index.core.llama_dataset import LabelledRagDataset
from llama_index.core import VectorStoreIndex
rag_dataset = LabelledRagDataset.from_json("./data/rag_dataset.json")
documents = SimpleDirectoryReader(input_dir="./data/source_files").load_data(
num_workers=4
) # parallel loading
index = VectorStoreIndex.from_documents(documents=documents)
query_engine = index.as_query_engine()
predictions_dataset = rag_dataset.make_predictions_with(query_engine)
questions, retrieved_context_lists, reference_answers, llm_answers = zip(
*[
(e.query, e.reference_contexts, e.reference_answer, p.response)
for e, p in zip(rag_dataset.examples, predictions_dataset.predictions)
]
)
100%|█████████████████████████████████████████████| 1/1 [00:00<00:00, 2.09it/s] Successfully downloaded EvaluatingLlmSurveyPaperDataset to ./data
from tonic_validate.metrics import AnswerSimilarityMetric
tonic_validate_evaluator = TonicValidateEvaluator(
metrics=[AnswerSimilarityMetric()], model_evaluator="gpt-4-1106-preview"
)
scores = await tonic_validate_evaluator.aevaluate_run(
questions, retrieved_context_lists, reference_answers, llm_answers
)
overall_scores
给出数据集中 276 个问题的平均分数。
scores.overall_scores
{'answer_similarity': 2.2644927536231885}
使用 pandas
和 matplotlib
,我们可以绘制相似性分数的直方图。
import matplotlib.pyplot as plt
import pandas as pd
score_list = [x.scores["answer_similarity"] for x in scores.run_data]
value_counts = pd.Series(score_list).value_counts()
fig, ax = plt.subplots()
ax.bar(list(value_counts.index), list(value_counts))
ax.set_title("Answer Similarity Score Value Counts")
plt.show()
由于 0 是最常见的分数,还有很大的改进空间。这是有道理的,因为我们使用的是默认参数。我们可以通过调整许多可能的 RAG 参数来优化这个分数,从而改进这些结果。