Prometheus-2 指南¶

本 Notebook 将演示如何使用 Prometheus 2：一个专门用于评估其他语言模型的开源语言模型。

论文摘要：¶

GPT-4 等专有语言模型常用于评估各种语言模型的响应质量。然而，透明度、可控性和经济性等方面的考量强烈推动了专门用于评估的开源语言模型的开发。另一方面，现有的开源评估语言模型存在严重缺陷：1）它们给出的分数与人类给出的分数存在显著差异，并且 2）它们缺乏执行直接评估和成对排序（两种最普遍的评估形式）的灵活性。此外，它们不具备基于自定义评估标准进行评估的能力，而是侧重于有用性和无害性等通用属性。为了解决这些问题，我们推出了 Prometheus 2，这是一款比其前代产品更强大的评估语言模型，其判断结果与人类和 GPT-4 的判断结果非常接近。此外，它能够处理直接评估和成对排序格式，并结合用户定义的评估标准。在四项直接评估基准和四项成对排序基准测试中，Prometheus 2 在所有测试的开源评估语言模型中，与人类和专有语言模型评估者的相关性和一致性得分最高。

注意：构建 Prometheus-2 的基础模型是 Mistral-7B 和 Mixtral8x7B。¶

在这里，我们将演示如何使用 Prometheus-2 作为 LlamaIndex 中以下评估器的评估器：

成对评估器 - 评估 LLM 是否会倾向于来自两个不同查询引擎的其中一个响应。
忠实度评估器 - 确定答案是否忠实于检索到的上下文，从而指示不存在幻觉。
正确性评估器 - 确定生成的答案是否与为查询提供的参考答案匹配，这需要标签。
相关性评估器 - 评估检索到的上下文和响应与查询的相关性。

如果您不熟悉以上评估器，请参阅我们的评估指南以获取更多信息。
演示中使用的 Prompt 部分灵感/取自 prometheus-eval 仓库。

安装¶

In [ ]

已复制!

!pip install llama-index
!pip install llama-index-llms-huggingface-api
!pip install llama-index !pip install llama-index-llms-huggingface-api

设置 API 密钥¶

In [ ]

已复制!

import os

os.environ["OPENAI_API_KEY"] = "sk-"  # OPENAI API KEY
import os os.environ["OPENAI_API_KEY"] = "sk-" # OPENAI API KEY

In [ ]

已复制!

# attach to the same event-loop
import nest_asyncio

nest_asyncio.apply()

from typing import Tuple, Optional
from IPython.display import Markdown, display
# 附加到同一个事件循环 import nest_asyncio nest_asyncio.apply() from typing import Tuple, Optional from IPython.display import Markdown, display

下载数据¶

为了演示，我们将使用 PaulGrahamEssay 数据集，并定义一个示例查询和参考答案。

In [ ]

已复制!

from llama_index.core.llama_dataset import download_llama_dataset

paul_graham_rag_dataset, paul_graham_documents = download_llama_dataset(
    "PaulGrahamEssayDataset", "./data/paul_graham"
)
from llama_index.core.llama_dataset import download_llama_dataset paul_graham_rag_dataset, paul_graham_documents = download_llama_dataset( "PaulGrahamEssayDataset", "./data/paul_graham" )

获取演示所需的查询和参考（真实）答案。

In [ ]

已复制!

query = paul_graham_rag_dataset[0].query
reference = paul_graham_rag_dataset[0].reference_answer
query = paul_graham_rag_dataset[0].query reference = paul_graham_rag_dataset[0].reference_answer

设置 LLM 和 Embedding 模型。¶

您需要在 Hugging Face 上部署模型或在本地加载。此处我们使用 HF Inference Endpoints 进行部署。

我们将使用 OpenAI Embedding 模型和 LLM 来构建索引，使用 prometheus LLM 进行评估。

In [ ]

已复制!





from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI

HF_TOKEN = "YOUR HF TOKEN"
HF_ENDPOINT_URL = "YOUR HF ENDPOINT URL"

prometheus_llm = HuggingFaceInferenceAPI(
    model_name=HF_ENDPOINT_URL,
    token=HF_TOKEN,
    temperature=0.0,
    do_sample=True,
    top_p=0.95,
    top_k=40,
    repetition_penalty=1.1,
    num_output=1024,
)
from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI HF_TOKEN = "YOUR HF TOKEN" HF_ENDPOINT_URL = "YOUR HF ENDPOINT URL" prometheus_llm = HuggingFaceInferenceAPI( model_name=HF_ENDPOINT_URL, token=HF_TOKEN, temperature=0.0, do_sample=True, top_p=0.95, top_k=40, repetition_penalty=1.1, num_output=1024, )

In [ ]

已复制!

from llama_index.core import Settings
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI

Settings.llm = OpenAI()
Settings.embed_model = OpenAIEmbedding()
Settings.chunk_size = 512
from llama_index.core import Settings from llama_index.embeddings.openai import OpenAIEmbedding from llama_index.llms.openai import OpenAI Settings.llm = OpenAI() Settings.embed_model = OpenAIEmbedding() Settings.chunk_size = 512

成对评估¶

构建两个 QueryEngine 进行成对评估。¶

In [ ]

已复制!

from llama_index.core.llama_dataset import LabelledRagDataset
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex

dataset_path = "./data/paul_graham"
rag_dataset = LabelledRagDataset.from_json(f"{dataset_path}/rag_dataset.json")
documents = SimpleDirectoryReader(
    input_dir=f"{dataset_path}/source_files"
).load_data()

index = VectorStoreIndex.from_documents(documents=documents)

query_engine1 = index.as_query_engine(similarity_top_k=1)

query_engine2 = index.as_query_engine(similarity_top_k=2)
from llama_index.core.llama_dataset import LabelledRagDataset from llama_index.core import SimpleDirectoryReader, VectorStoreIndex dataset_path = "./data/paul_graham" rag_dataset = LabelledRagDataset.from_json(f"{dataset_path}/rag_dataset.json") documents = SimpleDirectoryReader( input_dir=f"{dataset_path}/source_files" ).load_data() index = VectorStoreIndex.from_documents(documents=documents) query_engine1 = index.as_query_engine(similarity_top_k=1) query_engine2 = index.as_query_engine(similarity_top_k=2)

In [ ]

已复制!

response1 = str(query_engine1.query(query))
response2 = str(query_engine2.query(query))
response1 = str(query_engine1.query(query)) response2 = str(query_engine2.query(query))

In [ ]

已复制!

response1
response1

Out [ ]

'The author mentions using the IBM 1401 computer for programming in his early experiences. The language he used was an early version of Fortran. One of the challenges he faced was the limited input options for programs, as the only form of input was data stored on punched cards, which he did not have access to. This limitation made it difficult for him to create programs that required specific input data.'

In [ ]

已复制!

response2
response2

Out [ ]

'The author mentions using the IBM 1401 computer for programming in his early experiences. The language he used was an early version of Fortran. One of the challenges he faced was the limited input options for programs, as the only form of input was data stored on punched cards, which he did not have access to. This limitation made it difficult for him to create programs that required specific input data, leading to a lack of meaningful programming experiences on the IBM 1401.'

In [ ]

已复制!

ABS_SYSTEM_PROMPT = "You are a fair judge assistant tasked with providing clear, objective feedback based on specific criteria, ensuring each assessment reflects the absolute standards set for performance."
REL_SYSTEM_PROMPT = "You are a fair judge assistant assigned to deliver insightful feedback that compares individual performances, highlighting how each stands relative to others within the same cohort."
ABS_SYSTEM_PROMPT = "您是一位公正的评判助手，负责根据特定标准提供清晰、客观的反馈，确保每次评估都反映设定的绝对性能标准。" REL_SYSTEM_PROMPT = "您是一位公正的评判助手，负责提供富有洞察力的反馈，比较个人表现，强调同一批次中每个人相对于其他人的表现。"

In [ ]

已复制!





prometheus_pairwise_eval_prompt_template = """###Task Description:
An instruction (might include an Input inside it), a response to evaluate, and a score rubric representing a evaluation criteria are given.
1. Write a detailed feedback that assess the quality of two responses strictly based on the given score rubric, not evaluating in general.
2. After writing a feedback, choose a better response between Response A and Response B. You should refer to the score rubric.
3. The output format should look as follows: "Feedback: (write a feedback for criteria) [RESULT] (A or B)"
4. Please do not generate any other opening, closing, and explanations.

###Instruction:
Your task is to compare response A and Response B and give Feedback and score [RESULT] based on Rubric for the following query.
{query}

###Response A:
{answer_1}

###Response B:
{answer_2}

###Score Rubric:
A: If Response A is better than Response B.
B: If Response B is better than Response A.

###Feedback: """
prometheus_pairwise_eval_prompt_template = """###任务描述： 给定一个指令（可能包含输入）、一个要评估的响应以及一个代表评估标准的评分准则。 1. 严格根据给定的评分准则，而不是泛泛地，撰写一份详细的反馈，评估两个响应的质量。 2. 撰写反馈后，在响应 A 和响应 B 之间选择一个更好的响应。您应参考评分准则。 3. 输出格式应如下所示："反馈：(为准则撰写反馈) [结果] (A 或 B)" 4. 请勿生成任何其他开头、结尾和解释。 ###指令： 您的任务是比较响应 A 和响应 B，并根据以下查询的准则给出反馈和评分 [结果]。 {query} ###响应 A： {answer_1} ###响应 B： {answer_2} ###评分准则： A：如果响应 A 优于响应 B。 B：如果响应 B 优于响应 A。 ###反馈： """

In [ ]

已复制!





def parser_function(
    outputs: str,
) -> Tuple[Optional[bool], Optional[float], Optional[str]]:
    parts = outputs.split("[RESULT]")
    if len(parts) == 2:
        feedback, result = parts[0].strip(), parts[1].strip()
        if result == "A":
            return True, 0.0, feedback
        elif result == "B":
            return True, 1.0, feedback
    return None, None, None
def parser_function( outputs: str, ) -> Tuple[Optional[bool], Optional[float], Optional[str]]: parts = outputs.split("[RESULT]") if len(parts) == 2: feedback, result = parts[0].strip(), parts[1].strip() if result == "A": return True, 0.0, feedback elif result == "B": return True, 1.0, feedback return None, None, None

In [ ]

已复制!





from llama_index.core.evaluation import PairwiseComparisonEvaluator

prometheus_pairwise_evaluator = PairwiseComparisonEvaluator(
    llm=prometheus_llm,
    parser_function=parser_function,
    enforce_consensus=False,
    eval_template=REL_SYSTEM_PROMPT
    + "\n\n"
    + prometheus_pairwise_eval_prompt_template,
)
from llama_index.core.evaluation import PairwiseComparisonEvaluator prometheus_pairwise_evaluator = PairwiseComparisonEvaluator( llm=prometheus_llm, parser_function=parser_function, enforce_consensus=False, eval_template=REL_SYSTEM_PROMPT + "\n\n" + prometheus_pairwise_eval_prompt_template, )

In [ ]

已复制!

pairwise_result = await prometheus_pairwise_evaluator.aevaluate(
    query,
    response=response1,
    second_response=response2,
)
pairwise_result = await prometheus_pairwise_evaluator.aevaluate( query, response=response1, second_response=response2, )

In [ ]

已复制!

pairwise_result
pairwise_result

Out [ ]

EvaluationResult(query='In the essay, the author mentions his early experiences with programming. Describe the first computer he used for programming, the language he used, and the challenges he faced.', contexts=None, response="\nBoth responses accurately describe the first computer the author used for programming, the language he used, and the challenges he faced. However, Response B provides a more comprehensive understanding of the challenges faced by the author. It not only mentions the limited input options but also connects this limitation to the author's lack of meaningful programming experiences on the IBM 1401. This additional context in Response B enhances the reader's understanding of the author's experiences and the impact of the challenges he faced. Therefore, based on the score rubric, Response B is better than Response A as it offers a more detailed and insightful analysis of the author's early programming experiences. \n[RESULT] B", passing=True, feedback="\nBoth responses accurately describe the first computer the author used for programming, the language he used, and the challenges he faced. However, Response B provides a more comprehensive understanding of the challenges faced by the author. It not only mentions the limited input options but also connects this limitation to the author's lack of meaningful programming experiences on the IBM 1401. This additional context in Response B enhances the reader's understanding of the author's experiences and the impact of the challenges he faced. Therefore, based on the score rubric, Response B is better than Response A as it offers a more detailed and insightful analysis of the author's early programming experiences. \n[RESULT] B", score=1.0, pairwise_source='original', invalid_result=False, invalid_reason=None)

In [ ]

已复制!

pairwise_result.score
pairwise_result.score

Out [ ]

1.0

In [ ]

已复制!

display(Markdown(f"<b>{pairwise_result.feedback}</b>"))
display(Markdown(f"{pairwise_result.feedback}"))

两个响应都准确地描述了作者用于编程的第一台计算机、他使用的语言以及他面临的挑战。然而，响应 B 对作者面临的挑战提供了更全面的理解。它不仅提到了有限的输入选项，还将这一限制与作者在 IBM 1401 上缺乏有意义的编程经验联系起来。响应 B 中的这一额外上下文增强了读者对作者经历和他所面临挑战的影响的理解。因此，根据评分准则，响应 B 优于响应 A，因为它对作者早期的编程经历提供了更详细和富有洞察力的分析。[结果] B

观察：¶

根据反馈，第二个响应优于第一个响应，按照我们的解析器函数，其得分为 1.0。

正确性评估¶

In [ ]

已复制!





prometheus_correctness_eval_prompt_template = """###Task Description:
An instruction (might include an Input inside it), a query, a response to evaluate, a reference answer that gets a score of 5, and a score rubric representing a evaluation criteria are given.
1. Write a detailed feedback that assesses the quality of the response strictly based on the given score rubric, not evaluating in general.
2. After writing a feedback, write a score that is either 1 or 2 or 3 or 4 or 5. You should refer to the score rubric.
3. The output format should only look as follows: "Feedback: (write a feedback for criteria) [RESULT] (an integer number between 1 and 5)"
4. Please do not generate any other opening, closing, and explanations.
5. Only evaluate on common things between generated answer and reference answer. Don't evaluate on things which are present in reference answer but not in generated answer.

###Instruction:
Your task is to evaluate the generated answer and reference answer for the following query:
{query}

###Generate answer to evaluate:
{generated_answer}

###Reference Answer (Score 5):
{reference_answer}

###Score Rubrics:
Score 1: If the generated answer is not relevant to the user query and reference answer.
Score 2: If the generated answer is according to reference answer but not relevant to user query.
Score 3: If the generated answer is relevant to the user query and reference answer but contains mistakes.
Score 4: If the generated answer is relevant to the user query and has the exact same metrics as the reference answer, but it is not as concise.
Score 5: If the generated answer is relevant to the user query and fully correct according to the reference answer.

###Feedback:"""
prometheus_correctness_eval_prompt_template = """###任务描述： 给定一个指令（可能包含输入）、一个查询、一个要评估的响应、一个得分为 5 的参考答案以及一个代表评估标准的评分准则。 1. 严格根据给定的评分准则，而不是泛泛地，撰写一份详细的反馈，评估响应的质量。 2. 撰写反馈后，给出 1、2、3、4 或 5 的评分。您应参考评分准则。 3. 输出格式应仅如下所示："反馈：(为准则撰写反馈) [结果] (1 到 5 之间的整数)" 4. 请勿生成任何其他开头、结尾和解释。 5. 只评估生成答案和参考答案之间的共同点。不要评估参考答案中有而生成答案中没有的内容。 ###指令： 您的任务是评估以下查询的生成答案和参考答案： {query} ###要评估的生成答案： {generated_answer} ###参考答案（评分 5）： {reference_answer} ###评分准则： 评分 1：如果生成答案与用户查询和参考答案不相关。 评分 2：如果生成答案符合参考答案但不与用户查询相关。 评分 3：如果生成答案与用户查询和参考答案相关但包含错误。 评分 4：如果生成答案与用户查询相关并且与参考答案具有完全相同的指标，但不够简洁。 评分 5：如果生成答案与用户查询相关并且根据参考答案完全正确。 ###反馈："""

In [ ]

已复制!





from typing import Tuple
import re


def parser_function(output_str: str) -> Tuple[float, str]:
    # Print result to backtrack
    display(Markdown(f"<b>{output_str}</b>"))

    # Pattern to match the feedback and response
    # This pattern looks for any text ending with '[RESULT]' followed by a number
    pattern = r"(.+?) \[RESULT\] (\d)"

    # Using regex to find all matches
    matches = re.findall(pattern, output_str)

    # Check if any match is found
    if matches:
        # Assuming there's only one match in the text, extract feedback and response
        feedback, score = matches[0]
        score = float(score.strip()) if score is not None else score
        return score, feedback.strip()
    else:
        return None, None
from typing import Tuple import re def parser_function(output_str: str) -> Tuple[float, str]: # Print result to backtrack display(Markdown(f"{output_str}")) # Pattern to match the feedback and response # This pattern looks for any text ending with '[RESULT]' followed by a number pattern = r"(.+?) \[RESULT\] (\d)" # Using regex to find all matches matches = re.findall(pattern, output_str) # Check if any match is found if matches: # Assuming there's only one match in the text, extract feedback and response feedback, score = matches[0] score = float(score.strip()) if score is not None else score return score, feedback.strip() else: return None, None

In [ ]

已复制!





from llama_index.core.evaluation import (
    CorrectnessEvaluator,
    FaithfulnessEvaluator,
    RelevancyEvaluator,
)
from llama_index.core.callbacks import CallbackManager, TokenCountingHandler


# CorrectnessEvaluator with Prometheus model
prometheus_correctness_evaluator = CorrectnessEvaluator(
    llm=prometheus_llm,
    parser_function=parser_function,
    eval_template=ABS_SYSTEM_PROMPT
    + "\n\n"
    + prometheus_correctness_eval_prompt_template,
)
from llama_index.core.evaluation import ( CorrectnessEvaluator, FaithfulnessEvaluator, RelevancyEvaluator, ) from llama_index.core.callbacks import CallbackManager, TokenCountingHandler # CorrectnessEvaluator with Prometheus model prometheus_correctness_evaluator = CorrectnessEvaluator( llm=prometheus_llm, parser_function=parser_function, eval_template=ABS_SYSTEM_PROMPT + "\n\n" + prometheus_correctness_eval_prompt_template, )

In [ ]

已复制!

correctness_result = prometheus_correctness_evaluator.evaluate(
    query=query,
    response=response1,
    reference=reference,
)
correctness_result = prometheus_correctness_evaluator.evaluate( query=query, response=response1, reference=reference, )

生成答案与用户查询和参考答案相关，因为它正确地识别出 IBM 1401 是用于编程的第一台计算机，早期的 Fortran 版本是编程语言，以及输入选项有限的挑战。然而，该响应缺乏参考答案所包含的深度和细节。例如，它没有提到作者开始使用 IBM 1401 时的具体年龄，也没有提供由于缺乏输入数据而无法创建的程序类型的示例。这些遗漏使得该响应不如参考答案全面。因此，虽然生成答案准确且相关，但不如参考答案详尽。所以评分为 4。[结果] 4

In [ ]

已复制!

display(Markdown(f"<b>{correctness_result.score}</b>"))
display(Markdown(f"{correctness_result.score}"))

4.0

In [ ]

已复制!

display(Markdown(f"<b>{correctness_result.passing}</b>"))
display(Markdown(f"{correctness_result.passing}"))

True

In [ ]

已复制!

display(Markdown(f"<b>{correctness_result.feedback}</b>"))
display(Markdown(f"{correctness_result.feedback}"))

生成答案与用户查询和参考答案相关，因为它正确地识别出 IBM 1401 是用于编程的第一台计算机，早期的 Fortran 版本是编程语言，以及输入选项有限的挑战。然而，该响应缺乏参考答案所包含的深度和细节。例如，它没有提到作者开始使用 IBM 1401 时的具体年龄，也没有提供由于缺乏输入数据而无法创建的程序类型的示例。这些遗漏使得该响应不如参考答案全面。因此，虽然生成答案准确且相关，但不如参考答案详尽。所以评分为 4。

观察：¶

根据反馈，生成的答案与用户查询相关，并且与参考答案的指标精确匹配。然而，它不够简洁，因此得分为 4.0。尽管如此，根据阈值，该答案仍被视为通过（True）。

忠实度评估器¶

In [ ]

已复制!





prometheus_faithfulness_eval_prompt_template = """###Task Description:
An instruction (might include an Input inside it), an information, a context, and a score rubric representing evaluation criteria are given.
1. You are provided with evaluation task with the help of information, context information to give result based on score rubrics.
2. Write a detailed feedback based on evaluation task and the given score rubric, not evaluating in general.
3. After writing a feedback, write a score that is YES or NO. You should refer to the score rubric.
4. The output format should look as follows: "Feedback: (write a feedback for criteria) [RESULT] (YES or NO)”
5. Please do not generate any other opening, closing, and explanations.

###The instruction to evaluate: Your task is to evaluate if the given piece of information is supported by context.

###Information:
{query_str}

###Context:
{context_str}

###Score Rubrics:
Score YES: If the given piece of information is supported by context.
Score NO: If the given piece of information is not supported by context

###Feedback:"""

prometheus_faithfulness_refine_prompt_template = """###Task Description:
An instruction (might include an Input inside it), a information, a context information, an existing answer, and a score rubric representing a evaluation criteria are given.
1. You are provided with evaluation task with the help of information, context information and an existing answer.
2. Write a detailed feedback based on evaluation task and the given score rubric, not evaluating in general.
3. After writing a feedback, write a score that is YES or NO. You should refer to the score rubric.
4. The output format should look as follows: "Feedback: (write a feedback for criteria) [RESULT] (YES or NO)"
5. Please do not generate any other opening, closing, and explanations.

###The instruction to evaluate: If the information is present in the context and also provided with an existing answer.

###Existing answer:
{existing_answer}

###Information:
{query_str}

###Context:
{context_msg}

###Score Rubrics:
Score YES: If the existing answer is already YES or If the Information is present in the context.
Score NO: If the existing answer is NO and If the Information is not present in the context.

###Feedback: """
prometheus_faithfulness_eval_prompt_template = """###任务描述： 给定一个指令（可能包含输入）、一条信息、上下文以及代表评估标准的评分准则。 1. 您将根据信息、上下文信息获得评估任务，并根据评分准则给出结果。 2. 严格根据评估任务和给定的评分准则，而不是泛泛地，撰写一份详细的反馈。 3. 撰写反馈后，给出 YES 或 NO 的评分。您应参考评分准则。 4. 输出格式应如下所示："反馈：(为准则撰写反馈) [结果] (YES 或 NO)" 5. 请勿生成任何其他开头、结尾和解释。 ###评估指令： 您的任务是评估给定的信息是否得到上下文的支持。 ###信息： {query_str} ###上下文： {context_str} ###评分准则： 评分 YES：如果给定的信息得到上下文的支持。 评分 NO：如果给定的信息未得到上下文的支持。 ###反馈：""" prometheus_faithfulness_refine_prompt_template = """###任务描述： 给定一个指令（可能包含输入）、一条信息、上下文信息、一个现有答案以及代表评估标准的评分准则。 1. 您将根据信息、上下文信息和现有答案获得评估任务。 2. 严格根据评估任务和给定的评分准则，而不是泛泛地，撰写一份详细的反馈。 3. 撰写反馈后，给出 YES 或 NO 的评分。您应参考评分准则。 4. 输出格式应如下所示："反馈：(为准则撰写反馈) [结果] (YES 或 NO)" 5. 请勿生成任何其他开头、结尾和解释。 ###评估指令： 如果信息存在于上下文中并且提供了现有答案。 ###现有答案： {existing_answer} ###信息： {query_str} ###上下文： {context_msg} ###评分准则： 评分 YES：如果现有答案已经是 YES 或者信息存在于上下文中。 评分 NO：如果现有答案是 NO 并且信息不存在于上下文中。 ###反馈： """

In [ ]

已复制!





# FaithfulnessEvaluator with Prometheus model
prometheus_faithfulness_evaluator = FaithfulnessEvaluator(
    llm=prometheus_llm,
    eval_template=ABS_SYSTEM_PROMPT
    + "\n\n"
    + prometheus_faithfulness_eval_prompt_template,
    refine_template=ABS_SYSTEM_PROMPT
    + "\n\n"
    + prometheus_faithfulness_refine_prompt_template,
)
# FaithfulnessEvaluator with Prometheus model prometheus_faithfulness_evaluator = FaithfulnessEvaluator( llm=prometheus_llm, eval_template=ABS_SYSTEM_PROMPT + "\n\n" + prometheus_faithfulness_eval_prompt_template, refine_template=ABS_SYSTEM_PROMPT + "\n\n" + prometheus_faithfulness_refine_prompt_template, )

In [ ]

已复制!

response_vector = query_engine1.query(query)
response_vector = query_engine1.query(query)

In [ ]

已复制!

faithfulness_result = prometheus_faithfulness_evaluator.evaluate_response(
    response=response_vector
)
faithfulness_result = prometheus_faithfulness_evaluator.evaluate_response( response=response_vector )

In [ ]

已复制!

faithfulness_result.score
faithfulness_result.score

Out [ ]

1.0

In [ ]

已复制!

faithfulness_result.passing
faithfulness_result.passing

Out [ ]

True

观察：¶

分数和通过状态表示未观察到幻觉。

结论：¶

探索使用 Prometheus-2 进行开源评估很有意义。

反馈格式符合预期，使得解析和决策更加容易。

将其与 GPT-4 进行评估对比并考虑在评估中使用 Prometheus-2 是有价值的。

您可以参考我们的指南，了解如何将 GPT-4 作为评估器与开源评估模型进行实验对比。

Prometheus-2 指南¶

论文摘要：¶

注意：构建 Prometheus-2 的基础模型是 Mistral-7B 和 Mixtral8x7B。¶

安装¶

设置 API 密钥¶

下载数据¶

设置 LLM 和 Embedding 模型。¶

成对评估¶

构建两个 QueryEngine 进行成对评估。¶

观察：¶

正确性评估¶

观察：¶

忠实度评估器¶

观察：¶

相关性评估器¶

观察：¶

结论：¶