Prometheus-2 指南¶
本 Notebook 将演示如何使用 Prometheus 2:一个专门用于评估其他语言模型的开源语言模型。
论文摘要:¶
GPT-4 等专有语言模型常用于评估各种语言模型的响应质量。然而,透明度、可控性和经济性等方面的考量强烈推动了专门用于评估的开源语言模型的开发。另一方面,现有的开源评估语言模型存在严重缺陷:1)它们给出的分数与人类给出的分数存在显著差异,并且 2)它们缺乏执行直接评估和成对排序(两种最普遍的评估形式)的灵活性。此外,它们不具备基于自定义评估标准进行评估的能力,而是侧重于有用性和无害性等通用属性。为了解决这些问题,我们推出了 Prometheus 2,这是一款比其前代产品更强大的评估语言模型,其判断结果与人类和 GPT-4 的判断结果非常接近。此外,它能够处理直接评估和成对排序格式,并结合用户定义的评估标准。在四项直接评估基准和四项成对排序基准测试中,Prometheus 2 在所有测试的开源评估语言模型中,与人类和专有语言模型评估者的相关性和一致性得分最高。
注意:构建 Prometheus-2 的基础模型是 Mistral-7B 和 Mixtral8x7B。¶
在这里,我们将演示如何使用 Prometheus-2 作为 LlamaIndex 中以下评估器的评估器:
- 成对评估器 - 评估 LLM 是否会倾向于来自两个不同查询引擎的其中一个响应。
- 忠实度评估器 - 确定答案是否忠实于检索到的上下文,从而指示不存在幻觉。
- 正确性评估器 - 确定生成的答案是否与为查询提供的参考答案匹配,这需要标签。
- 相关性评估器 - 评估检索到的上下文和响应与查询的相关性。
如果您不熟悉以上评估器,请参阅我们的评估指南以获取更多信息。
演示中使用的 Prompt 部分灵感/取自 prometheus-eval 仓库。
安装¶
!pip install llama-index
!pip install llama-index-llms-huggingface-api
设置 API 密钥¶
import os
os.environ["OPENAI_API_KEY"] = "sk-" # OPENAI API KEY
# attach to the same event-loop
import nest_asyncio
nest_asyncio.apply()
from typing import Tuple, Optional
from IPython.display import Markdown, display
下载数据¶
为了演示,我们将使用 PaulGrahamEssay 数据集,并定义一个示例查询和参考答案。
from llama_index.core.llama_dataset import download_llama_dataset
paul_graham_rag_dataset, paul_graham_documents = download_llama_dataset(
"PaulGrahamEssayDataset", "./data/paul_graham"
)
获取演示所需的查询和参考(真实)答案。
query = paul_graham_rag_dataset[0].query
reference = paul_graham_rag_dataset[0].reference_answer
设置 LLM 和 Embedding 模型。¶
您需要在 Hugging Face 上部署模型或在本地加载。此处我们使用 HF Inference Endpoints 进行部署。
我们将使用 OpenAI Embedding 模型和 LLM 来构建索引,使用 prometheus LLM 进行评估。
from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI
HF_TOKEN = "YOUR HF TOKEN"
HF_ENDPOINT_URL = "YOUR HF ENDPOINT URL"
prometheus_llm = HuggingFaceInferenceAPI(
model_name=HF_ENDPOINT_URL,
token=HF_TOKEN,
temperature=0.0,
do_sample=True,
top_p=0.95,
top_k=40,
repetition_penalty=1.1,
num_output=1024,
)
from llama_index.core import Settings
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
Settings.llm = OpenAI()
Settings.embed_model = OpenAIEmbedding()
Settings.chunk_size = 512
成对评估¶
构建两个 QueryEngine 进行成对评估。¶
from llama_index.core.llama_dataset import LabelledRagDataset
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
dataset_path = "./data/paul_graham"
rag_dataset = LabelledRagDataset.from_json(f"{dataset_path}/rag_dataset.json")
documents = SimpleDirectoryReader(
input_dir=f"{dataset_path}/source_files"
).load_data()
index = VectorStoreIndex.from_documents(documents=documents)
query_engine1 = index.as_query_engine(similarity_top_k=1)
query_engine2 = index.as_query_engine(similarity_top_k=2)
response1 = str(query_engine1.query(query))
response2 = str(query_engine2.query(query))
response1
'The author mentions using the IBM 1401 computer for programming in his early experiences. The language he used was an early version of Fortran. One of the challenges he faced was the limited input options for programs, as the only form of input was data stored on punched cards, which he did not have access to. This limitation made it difficult for him to create programs that required specific input data.'
response2
'The author mentions using the IBM 1401 computer for programming in his early experiences. The language he used was an early version of Fortran. One of the challenges he faced was the limited input options for programs, as the only form of input was data stored on punched cards, which he did not have access to. This limitation made it difficult for him to create programs that required specific input data, leading to a lack of meaningful programming experiences on the IBM 1401.'
ABS_SYSTEM_PROMPT = "You are a fair judge assistant tasked with providing clear, objective feedback based on specific criteria, ensuring each assessment reflects the absolute standards set for performance."
REL_SYSTEM_PROMPT = "You are a fair judge assistant assigned to deliver insightful feedback that compares individual performances, highlighting how each stands relative to others within the same cohort."
prometheus_pairwise_eval_prompt_template = """###Task Description:
An instruction (might include an Input inside it), a response to evaluate, and a score rubric representing a evaluation criteria are given.
1. Write a detailed feedback that assess the quality of two responses strictly based on the given score rubric, not evaluating in general.
2. After writing a feedback, choose a better response between Response A and Response B. You should refer to the score rubric.
3. The output format should look as follows: "Feedback: (write a feedback for criteria) [RESULT] (A or B)"
4. Please do not generate any other opening, closing, and explanations.
###Instruction:
Your task is to compare response A and Response B and give Feedback and score [RESULT] based on Rubric for the following query.
{query}
###Response A:
{answer_1}
###Response B:
{answer_2}
###Score Rubric:
A: If Response A is better than Response B.
B: If Response B is better than Response A.
###Feedback: """
def parser_function(
outputs: str,
) -> Tuple[Optional[bool], Optional[float], Optional[str]]:
parts = outputs.split("[RESULT]")
if len(parts) == 2:
feedback, result = parts[0].strip(), parts[1].strip()
if result == "A":
return True, 0.0, feedback
elif result == "B":
return True, 1.0, feedback
return None, None, None
from llama_index.core.evaluation import PairwiseComparisonEvaluator
prometheus_pairwise_evaluator = PairwiseComparisonEvaluator(
llm=prometheus_llm,
parser_function=parser_function,
enforce_consensus=False,
eval_template=REL_SYSTEM_PROMPT
+ "\n\n"
+ prometheus_pairwise_eval_prompt_template,
)
pairwise_result = await prometheus_pairwise_evaluator.aevaluate(
query,
response=response1,
second_response=response2,
)
pairwise_result
EvaluationResult(query='In the essay, the author mentions his early experiences with programming. Describe the first computer he used for programming, the language he used, and the challenges he faced.', contexts=None, response="\nBoth responses accurately describe the first computer the author used for programming, the language he used, and the challenges he faced. However, Response B provides a more comprehensive understanding of the challenges faced by the author. It not only mentions the limited input options but also connects this limitation to the author's lack of meaningful programming experiences on the IBM 1401. This additional context in Response B enhances the reader's understanding of the author's experiences and the impact of the challenges he faced. Therefore, based on the score rubric, Response B is better than Response A as it offers a more detailed and insightful analysis of the author's early programming experiences. \n[RESULT] B", passing=True, feedback="\nBoth responses accurately describe the first computer the author used for programming, the language he used, and the challenges he faced. However, Response B provides a more comprehensive understanding of the challenges faced by the author. It not only mentions the limited input options but also connects this limitation to the author's lack of meaningful programming experiences on the IBM 1401. This additional context in Response B enhances the reader's understanding of the author's experiences and the impact of the challenges he faced. Therefore, based on the score rubric, Response B is better than Response A as it offers a more detailed and insightful analysis of the author's early programming experiences. \n[RESULT] B", score=1.0, pairwise_source='original', invalid_result=False, invalid_reason=None)
pairwise_result.score
1.0
display(Markdown(f"<b>{pairwise_result.feedback}</b>"))
观察:¶
根据反馈,第二个响应优于第一个响应,按照我们的解析器函数,其得分为 1.0。
正确性评估¶
prometheus_correctness_eval_prompt_template = """###Task Description:
An instruction (might include an Input inside it), a query, a response to evaluate, a reference answer that gets a score of 5, and a score rubric representing a evaluation criteria are given.
1. Write a detailed feedback that assesses the quality of the response strictly based on the given score rubric, not evaluating in general.
2. After writing a feedback, write a score that is either 1 or 2 or 3 or 4 or 5. You should refer to the score rubric.
3. The output format should only look as follows: "Feedback: (write a feedback for criteria) [RESULT] (an integer number between 1 and 5)"
4. Please do not generate any other opening, closing, and explanations.
5. Only evaluate on common things between generated answer and reference answer. Don't evaluate on things which are present in reference answer but not in generated answer.
###Instruction:
Your task is to evaluate the generated answer and reference answer for the following query:
{query}
###Generate answer to evaluate:
{generated_answer}
###Reference Answer (Score 5):
{reference_answer}
###Score Rubrics:
Score 1: If the generated answer is not relevant to the user query and reference answer.
Score 2: If the generated answer is according to reference answer but not relevant to user query.
Score 3: If the generated answer is relevant to the user query and reference answer but contains mistakes.
Score 4: If the generated answer is relevant to the user query and has the exact same metrics as the reference answer, but it is not as concise.
Score 5: If the generated answer is relevant to the user query and fully correct according to the reference answer.
###Feedback:"""
from typing import Tuple
import re
def parser_function(output_str: str) -> Tuple[float, str]:
# Print result to backtrack
display(Markdown(f"<b>{output_str}</b>"))
# Pattern to match the feedback and response
# This pattern looks for any text ending with '[RESULT]' followed by a number
pattern = r"(.+?) \[RESULT\] (\d)"
# Using regex to find all matches
matches = re.findall(pattern, output_str)
# Check if any match is found
if matches:
# Assuming there's only one match in the text, extract feedback and response
feedback, score = matches[0]
score = float(score.strip()) if score is not None else score
return score, feedback.strip()
else:
return None, None
from llama_index.core.evaluation import (
CorrectnessEvaluator,
FaithfulnessEvaluator,
RelevancyEvaluator,
)
from llama_index.core.callbacks import CallbackManager, TokenCountingHandler
# CorrectnessEvaluator with Prometheus model
prometheus_correctness_evaluator = CorrectnessEvaluator(
llm=prometheus_llm,
parser_function=parser_function,
eval_template=ABS_SYSTEM_PROMPT
+ "\n\n"
+ prometheus_correctness_eval_prompt_template,
)
correctness_result = prometheus_correctness_evaluator.evaluate(
query=query,
response=response1,
reference=reference,
)
display(Markdown(f"<b>{correctness_result.score}</b>"))
4.0
display(Markdown(f"<b>{correctness_result.passing}</b>"))
True
display(Markdown(f"<b>{correctness_result.feedback}</b>"))
生成答案与用户查询和参考答案相关,因为它正确地识别出 IBM 1401 是用于编程的第一台计算机,早期的 Fortran 版本是编程语言,以及输入选项有限的挑战。然而,该响应缺乏参考答案所包含的深度和细节。例如,它没有提到作者开始使用 IBM 1401 时的具体年龄,也没有提供由于缺乏输入数据而无法创建的程序类型的示例。这些遗漏使得该响应不如参考答案全面。因此,虽然生成答案准确且相关,但不如参考答案详尽。所以评分为 4。
观察:¶
根据反馈,生成的答案与用户查询相关,并且与参考答案的指标精确匹配。然而,它不够简洁,因此得分为 4.0。尽管如此,根据阈值,该答案仍被视为通过(True)。
忠实度评估器¶
prometheus_faithfulness_eval_prompt_template = """###Task Description:
An instruction (might include an Input inside it), an information, a context, and a score rubric representing evaluation criteria are given.
1. You are provided with evaluation task with the help of information, context information to give result based on score rubrics.
2. Write a detailed feedback based on evaluation task and the given score rubric, not evaluating in general.
3. After writing a feedback, write a score that is YES or NO. You should refer to the score rubric.
4. The output format should look as follows: "Feedback: (write a feedback for criteria) [RESULT] (YES or NO)”
5. Please do not generate any other opening, closing, and explanations.
###The instruction to evaluate: Your task is to evaluate if the given piece of information is supported by context.
###Information:
{query_str}
###Context:
{context_str}
###Score Rubrics:
Score YES: If the given piece of information is supported by context.
Score NO: If the given piece of information is not supported by context
###Feedback:"""
prometheus_faithfulness_refine_prompt_template = """###Task Description:
An instruction (might include an Input inside it), a information, a context information, an existing answer, and a score rubric representing a evaluation criteria are given.
1. You are provided with evaluation task with the help of information, context information and an existing answer.
2. Write a detailed feedback based on evaluation task and the given score rubric, not evaluating in general.
3. After writing a feedback, write a score that is YES or NO. You should refer to the score rubric.
4. The output format should look as follows: "Feedback: (write a feedback for criteria) [RESULT] (YES or NO)"
5. Please do not generate any other opening, closing, and explanations.
###The instruction to evaluate: If the information is present in the context and also provided with an existing answer.
###Existing answer:
{existing_answer}
###Information:
{query_str}
###Context:
{context_msg}
###Score Rubrics:
Score YES: If the existing answer is already YES or If the Information is present in the context.
Score NO: If the existing answer is NO and If the Information is not present in the context.
###Feedback: """
# FaithfulnessEvaluator with Prometheus model
prometheus_faithfulness_evaluator = FaithfulnessEvaluator(
llm=prometheus_llm,
eval_template=ABS_SYSTEM_PROMPT
+ "\n\n"
+ prometheus_faithfulness_eval_prompt_template,
refine_template=ABS_SYSTEM_PROMPT
+ "\n\n"
+ prometheus_faithfulness_refine_prompt_template,
)
response_vector = query_engine1.query(query)
faithfulness_result = prometheus_faithfulness_evaluator.evaluate_response(
response=response_vector
)
faithfulness_result.score
1.0
faithfulness_result.passing
True
观察:¶
分数和通过状态表示未观察到幻觉。
相关性评估器¶
prometheus_relevancy_eval_prompt_template = """###Task Description:
An instruction (might include an Input inside it), a query with response, context, and a score rubric representing evaluation criteria are given.
1. You are provided with evaluation task with the help of a query with response and context.
2. Write a detailed feedback based on evaluation task and the given score rubric, not evaluating in general.
3. After writing a feedback, write a score that is A or B. You should refer to the score rubric.
4. The output format should look as follows: "Feedback: (write a feedback for criteria) [RESULT] (YES or NO)”
5. Please do not generate any other opening, closing, and explanations.
###The instruction to evaluate: Your task is to evaluate if the response for the query is in line with the context information provided.
###Query and Response:
{query_str}
###Context:
{context_str}
###Score Rubrics:
Score YES: If the response for the query is in line with the context information provided.
Score NO: If the response for the query is not in line with the context information provided.
###Feedback: """
prometheus_relevancy_refine_prompt_template = """###Task Description:
An instruction (might include an Input inside it), a query with response, context, an existing answer, and a score rubric representing a evaluation criteria are given.
1. You are provided with evaluation task with the help of a query with response and context and an existing answer.
2. Write a detailed feedback based on evaluation task and the given score rubric, not evaluating in general.
3. After writing a feedback, write a score that is YES or NO. You should refer to the score rubric.
4. The output format should look as follows: "Feedback: (write a feedback for criteria) [RESULT] (YES or NO)"
5. Please do not generate any other opening, closing, and explanations.
###The instruction to evaluate: Your task is to evaluate if the response for the query is in line with the context information provided.
###Query and Response:
{query_str}
###Context:
{context_str}
###Score Rubrics:
Score YES: If the existing answer is already YES or If the response for the query is in line with the context information provided.
Score NO: If the existing answer is NO and If the response for the query is in line with the context information provided.
###Feedback: """
# RelevancyEvaluator with Prometheus model
prometheus_relevancy_evaluator = RelevancyEvaluator(
llm=prometheus_llm,
eval_template=ABS_SYSTEM_PROMPT
+ "\n\n"
+ prometheus_relevancy_eval_prompt_template,
refine_template=ABS_SYSTEM_PROMPT
+ "\n\n"
+ prometheus_relevancy_refine_prompt_template,
)
relevancy_result = prometheus_relevancy_evaluator.evaluate_response(
query=query, response=response_vector
)
relevancy_result.score
1.0
relevancy_result.passing
True
display(Markdown(f"<b>{relevancy_result.feedback}</b>"))
观察:¶
反馈表明,查询的响应与提供的上下文信息高度一致,因此评分为 1.0,通过状态为 True。