使用 Prometheus 模型进行评估¶

评估是迭代 RAG（检索增强生成）管道的关键方面。此过程严重依赖于 GPT-4。然而，最近出现了一个名为 Prometheus 的新开源模型，作为评估目的的替代方案。

在本笔记本中，我们将演示如何利用 Prometheus 模型进行评估，并将其与 LlamaIndex 抽象集成。

如果您不熟悉 Prometheus 模型，可能会发现 Andrei 准备的论文摘要很有帮助。需要注意的是，为了有效评估，该模型要求在提示中包含评分标准分数。有关更多详细信息，您可以参考笔记本中概述的具体提示。

Prometheus Paper Card

我们将使用 Llama Datasets 中的两个数据集，演示如何使用 Prometheus 模型进行正确性评估。如果您还没有探索过 Llama Datasets，我建议您花点时间在此处阅读相关信息。

Paul Graham 文章
Llama2

注意：此处展示的是用于分析的原始 Prometheus 模型。您可以使用模型的量化版本重新运行分析。¶

In [ ]

已复制！

%pip install llama-index-llms-openai
%pip install llama-index-llms-huggingface-api
%pip install llama-index-llms-openai %pip install llama-index-llms-huggingface-api

In [ ]

已复制！

# attach to the same event-loop
import nest_asyncio

nest_asyncio.apply()
# attach to the same event-loop import nest_asyncio nest_asyncio.apply()

下载数据集¶

In [ ]

已复制！

from llama_index.core.llama_dataset import download_llama_dataset

paul_graham_rag_dataset, paul_graham_documents = download_llama_dataset(
    "PaulGrahamEssayDataset", "./data/paul_graham"
)

llama2_rag_dataset, llama2_documents = download_llama_dataset(
    "Llama2PaperDataset", "./data/llama2"
)
from llama_index.core.llama_dataset import download_llama_dataset paul_graham_rag_dataset, paul_graham_documents = download_llama_dataset( "PaulGrahamEssayDataset", "./data/paul_graham" ) llama2_rag_dataset, llama2_documents = download_llama_dataset( "Llama2PaperDataset", "./data/llama2" )

定义托管在 HuggingFace 上的 Prometheus LLM。¶

我们使用 Nvidia A10G GPU 将模型托管在 HF 推理端点上。

In [ ]

已复制！





from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI

HF_TOKEN = "YOUR HF TOKEN"
HF_ENDPOINT_URL = (
    "https://q3yljc2cypyrvw3i.us-east-1.aws.endpoints.huggingface.cloud"
)

prometheus_llm = HuggingFaceInferenceAPI(
    model_name=HF_ENDPOINT_URL,
    token=HF_TOKEN,
    temperature=0.1,
    do_sample=True,
    top_p=0.95,
    top_k=40,
    repetition_penalty=1.1,
)
from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI HF_TOKEN = "YOUR HF TOKEN" HF_ENDPOINT_URL = ( "https://q3yljc2cypyrvw3i.us-east-1.aws.endpoints.huggingface.cloud" ) prometheus_llm = HuggingFaceInferenceAPI( model_name=HF_ENDPOINT_URL, token=HF_TOKEN, temperature=0.1, do_sample=True, top_p=0.95, top_k=40, repetition_penalty=1.1, )

/opt/homebrew/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

提示模板。¶

我们将对 Prometheus 模型和 GPT-4 使用相同的提示，以进行一致的性能比较。

正确性评估提示¶

In [ ]

已复制！





prometheus_correctness_eval_prompt_template = """###Task Description: An instruction (might include an Input inside it), a query, a response to evaluate, a reference answer that gets a score of 5, and a score rubric representing a evaluation criteria are given. 
			1. Write a detailed feedback that assesses the quality of the response strictly based on the given score rubric, not evaluating in general. 
			2. After writing a feedback, write a score that is either 1 or 2 or 3 or 4 or 5. You should refer to the score rubric. 
			3. The output format should look as follows: "Feedback: (write a feedback for criteria) [RESULT] (1 or 2 or 3 or 4 or 5)" 
			4. Please do not generate any other opening, closing, and explanations. 
            5. Only evaluate on common things between generated answer and reference answer. Don't evaluate on things which are present in reference answer but not in generated answer.

			###The instruction to evaluate: Your task is to evaluate the generated answer and reference answer for the query: {query}
			
            ###Generate answer to evaluate: {generated_answer} 

            ###Reference Answer (Score 5): {reference_answer}
            
    		###Score Rubrics: 
            Score 1: If the generated answer is not relevant to the user query and reference answer.
            Score 2: If the generated answer is according to reference answer but not relevant to user query.
            Score 3: If the generated answer is relevant to the user query and reference answer but contains mistakes.
    		Score 4: If the generated answer is relevant to the user query and has the exact same metrics as the reference answer, but it is not as concise.
            Score 5: If the generated answer is relevant to the user query and fully correct according to the reference answer.
    
    		###Feedback:"""
prometheus_correctness_eval_prompt_template = """###任务描述：给定一条指令（可能包含一个输入）、一个查询、一个待评估的响应、一个得分为 5 的参考答案以及一个代表评估标准的评分规则。 1. 严格根据给定的评分规则，而不是一般性地，撰写详细的反馈来评估响应的质量。 2. 写完反馈后，写一个分数，可以是 1、2、3、4 或 5。您应该参考评分规则。 3. 输出格式应如下所示：“反馈：(根据标准撰写反馈) [结果] (1、2、3、4 或 5)” 4. 请不要生成任何其他开头、结尾和解释。 5. 仅评估生成答案和参考答案之间的共同点。不要评估参考答案中有而生成答案中没有的内容。 ###要评估的指令：您的任务是评估针对查询 {query} 的生成答案和参考答案。 ###要评估的生成答案：{generated_answer} ###参考答案（得分 5）：{reference_answer} ###评分规则： 得分 1：如果生成的答案与用户查询和参考答案不相关。 得分 2：如果生成的答案符合参考答案，但与用户查询不相关。 得分 3：如果生成的答案与用户查询和参考答案相关，但包含错误。 得分 4：如果生成的答案与用户查询相关，并且与参考答案具有完全相同的指标，但不够简洁。 得分 5：如果生成的答案与用户查询相关，并且根据参考答案完全正确。 ###反馈："""

In [ ]

已复制！





prometheus_correctness_eval_prompt_template = """###Task Description: An instruction (might include an Input inside it), a query, a response to evaluate, a reference answer that gets a score of 5, and a score rubric representing a evaluation criteria are given. 
			1. Write a detailed feedback that assesses the quality of the response strictly based on the given score rubric, not evaluating in general. 
			2. After writing a feedback, write a score that is either 1 or 2 or 3 or 4 or 5. You should refer to the score rubric. 
			3. The output format should look as follows: "Feedback: (write a feedback for criteria) [RESULT] (1 or 2 or 3 or 4 or 5)" 
			4. Please do not generate any other opening, closing, and explanations. 
            5. Only evaluate on common things between generated answer and reference answer. Don't evaluate on things which are present in reference answer but not in generated answer.

			###The instruction to evaluate: Your task is to evaluate the generated answer and reference answer for the query: {query}
			
            ###Generate answer to evaluate: {generated_answer} 

            ###Reference Answer (Score 5): {reference_answer}
            
    		###Score Rubrics: 
            Score 1: If the generated answer is not relevant to the user query and reference answer.
            Score 2: If the generated answer is correct according to reference answer but not relevant to user query.
            Score 3: If the generated answer is relevant to the user query and correct according to reference answer but has some mistakes in facts.
    		Score 4: If the generated answer is relevant to the user query and has the exact same metrics and correct as the reference answer, but it is not as concise.
            Score 5: If the generated answer is relevant to the user query and fully correct according to the reference answer.
    
    		###Feedback:"""
prometheus_correctness_eval_prompt_template = """###任务描述：给定一条指令（可能包含一个输入）、一个查询、一个待评估的响应、一个得分为 5 的参考答案以及一个代表评估标准的评分规则。 1. 严格根据给定的评分规则，而不是一般性地，撰写详细的反馈来评估响应的质量。 2. 写完反馈后，写一个分数，可以是 1、2、3、4 或 5。您应该参考评分规则。 3. 输出格式应如下所示：“反馈：(根据标准撰写反馈) [结果] (1、2、3、4 或 5)” 4. 请不要生成任何其他开头、结尾和解释。 5. 仅评估生成答案和参考答案之间的共同点。不要评估参考答案中有而生成答案中没有的内容。 ###要评估的指令：您的任务是评估针对查询 {query} 的生成答案和参考答案。 ###要评估的生成答案：{generated_answer} ###参考答案（得分 5）：{reference_answer} ###评分规则： 得分 1：如果生成的答案与用户查询和参考答案不相关。 得分 2：如果生成的答案符合参考答案，但与用户查询不相关。 得分 3：如果生成的答案与用户查询相关，并根据参考答案是正确的，但在事实方面有一些错误。 得分 4：如果生成的答案与用户查询相关，并且与参考答案具有完全相同的指标并正确，但不够简洁。 得分 5：如果生成的答案与用户查询相关，并且根据参考答案完全正确。 ###反馈："""

忠实度评估提示¶

In [ ]

已复制！





prometheus_faithfulness_eval_prompt_template = """###Task Description: An instruction (might include an Input inside it), an information, a context, and a score rubric representing evaluation criteria are given. 
	        1. You are provided with evaluation task with the help of information, context information to give result based on score rubrics.
            2. Write a detailed feedback based on evaluation task and the given score rubric, not evaluating in general. 
			3. After writing a feedback, write a score that is YES or NO. You should refer to the score rubric. 
            4. The output format should look as follows: "Feedback: (write a feedback for criteria) [RESULT] (YES or NO)” 
            5. Please do not generate any other opening, closing, and explanations. 

        ###The instruction to evaluate: Your task is to evaluate if the given piece of information is supported by context.

        ###Information: {query_str} 

        ###Context: {context_str}
            
        ###Score Rubrics: 
        Score YES: If the given piece of information is supported by context.
        Score NO: If the given piece of information is not supported by context
    
        ###Feedback: """

prometheus_faithfulness_refine_prompt_template = """###Task Description: An instruction (might include an Input inside it), a information, a context information, an existing answer, and a score rubric representing a evaluation criteria are given. 
			1. You are provided with evaluation task with the help of information, context information and an existing answer.
            2. Write a detailed feedback based on evaluation task and the given score rubric, not evaluating in general.
			3. After writing a feedback, write a score that is YES or NO. You should refer to the score rubric. 
			4. The output format should look as follows: "Feedback: (write a feedback for criteria) [RESULT] (YES or NO)" 
			5. Please do not generate any other opening, closing, and explanations. 

			###The instruction to evaluate: If the information is present in the context and also provided with an existing answer.

			###Existing answer: {existing_answer} 

            ###Information: {query_str}

            ###Context: {context_msg}
            
    		###Score Rubrics: 
            Score YES: If the existing answer is already YES or If the Information is present in the context.
            Score NO: If the existing answer is NO and If the Information is not present in the context.
    
    		###Feedback: """
prometheus_faithfulness_eval_prompt_template = """###任务描述：给定一条指令（可能包含一个输入）、一条信息、一段上下文以及一个代表评估标准的评分规则。 1. 根据评分规则，借助信息和上下文信息，您将执行评估任务并给出结果。 2. 严格根据评估任务和给定的评分规则，而不是一般性地，撰写详细的反馈。 3. 写完反馈后，写一个分数，可以是 YES 或 NO。您应该参考评分规则。 4. 输出格式应如下所示：“反馈：(根据标准撰写反馈) [结果] (YES 或 NO)” 5. 请不要生成任何其他开头、结尾和解释。 ###要评估的指令：您的任务是评估给定的信息是否得到上下文的支持。 ###信息：{query_str} ###上下文：{context_str} ###评分规则： 得分 YES：如果给定的信息得到上下文的支持。 得分 NO：如果给定的信息未得到上下文的支持。 ###反馈： """ prometheus_faithfulness_refine_prompt_template = """###任务描述：给定一条指令（可能包含一个输入）、一条信息、一段上下文信息、一个现有答案以及一个代表评估标准的评分规则。 1. 借助信息、上下文信息和现有答案，您将执行评估任务。 2. 严格根据评估任务和给定的评分规则，而不是一般性地，撰写详细的反馈。 3. 写完反馈后，写一个分数，可以是 YES 或 NO。您应该参考评分规则。 4. 输出格式应如下所示：“反馈：(根据标准撰写反馈) [结果] (YES 或 NO)” 5. 请不要生成任何其他开头、结尾和解释。 ###要评估的指令：如果信息存在于上下文且提供了现有答案。 ###现有答案：{existing_answer} ###信息：{query_str} ###上下文：{context_msg} ###评分规则： 得分 YES：如果现有答案已经是 YES 或信息存在于上下文。 得分 NO：如果现有答案是 NO 且信息不存在于上下文。 ###反馈： """

定义解析器函数¶

它将用于正确性评估器。

In [ ]

已复制！





from typing import Tuple
import re


def parser_function(output_str: str) -> Tuple[float, str]:
    # Pattern to match the feedback and response
    # This pattern looks for any text ending with '[RESULT]' followed by a number
    pattern = r"(.+?) \[RESULT\] (\d)"

    # Using regex to find all matches
    matches = re.findall(pattern, output_str)

    # Check if any match is found
    if matches:
        # Assuming there's only one match in the text, extract feedback and response
        feedback, score = matches[0]
        score = float(score.strip()) if score is not None else score
        return score, feedback.strip()
    else:
        return None, None
from typing import Tuple import re def parser_function(output_str: str) -> Tuple[float, str]: # Pattern to match the feedback and response # This pattern looks for any text ending with '[RESULT]' followed by a number pattern = r"(.+?) \[RESULT\] (\d)" # Using regex to find all matches matches = re.findall(pattern, output_str) # Check if any match is found if matches: # Assuming there's only one match in the text, extract feedback and response feedback, score = matches[0] score = float(score.strip()) if score is not None else score return score, feedback.strip() else: return None, None

定义正确性、忠实度和相关性评估器¶

In [ ]

已复制！





from llama_index.core.evaluation import (
    CorrectnessEvaluator,
    FaithfulnessEvaluator,
    RelevancyEvaluator,
)
from llama_index.core.callbacks import CallbackManager, TokenCountingHandler
import tiktoken


# CorrectnessEvaluator with Prometheus model
prometheus_correctness_evaluator = CorrectnessEvaluator(
    llm=prometheus_llm,
    parser_function=parser_function,
    eval_template=prometheus_correctness_eval_prompt_template,
)

# FaithfulnessEvaluator with Prometheus model
prometheus_faithfulness_evaluator = FaithfulnessEvaluator(
    llm=prometheus_llm,
    eval_template=prometheus_faithfulness_eval_prompt_template,
    refine_template=prometheus_faithfulness_refine_prompt_template,
)

# RelevancyEvaluator with Prometheus model
prometheus_relevancy_evaluator = RelevancyEvaluator(
    llm=prometheus_llm,
    eval_template=prometheus_relevancy_eval_prompt_template,
    refine_template=prometheus_relevancy_refine_prompt_template,
)

# Set the encoding model to `gpt-4` for token counting.
token_counter = TokenCountingHandler(
    tokenizer=tiktoken.encoding_for_model("gpt-4").encode
)

callback_manager = CallbackManager([token_counter])
gpt4_llm.callback_manager = callback_manager

# CorrectnessEvaluator with GPT-4 model
gpt4_correctness_evaluator = CorrectnessEvaluator(
    llm=gpt4_llm,
    # parser_function=parser_function,
)

# FaithfulnessEvaluator with GPT-4 model
gpt4_faithfulness_evaluator = FaithfulnessEvaluator(
    llm=gpt4_llm,
    eval_template=prometheus_faithfulness_eval_prompt_template,
    refine_template=prometheus_faithfulness_refine_prompt_template,
)

# RelevancyEvaluator with GPT-4 model
gpt4_relevancy_evaluator = RelevancyEvaluator(
    llm=gpt4_llm,
    eval_template=prometheus_relevancy_eval_prompt_template,
    refine_template=prometheus_relevancy_refine_prompt_template,
)

# create a dictionary of evaluators
prometheus_evaluators = {
    "correctness": prometheus_correctness_evaluator,
    "faithfulness": prometheus_faithfulness_evaluator,
    "relevancy": prometheus_relevancy_evaluator,
}

gpt4_evaluators = {
    "correctness": gpt4_correctness_evaluator,
    "faithfulness": gpt4_faithfulness_evaluator,
    "relevancy": gpt4_relevancy_evaluator,
}
from llama_index.core.evaluation import ( CorrectnessEvaluator, FaithfulnessEvaluator, RelevancyEvaluator, ) from llama_index.core.callbacks import CallbackManager, TokenCountingHandler import tiktoken # CorrectnessEvaluator with Prometheus model prometheus_correctness_evaluator = CorrectnessEvaluator( llm=prometheus_llm, parser_function=parser_function, eval_template=prometheus_correctness_eval_prompt_template, ) # FaithfulnessEvaluator with Prometheus model prometheus_faithfulness_evaluator = FaithfulnessEvaluator( llm=prometheus_llm, eval_template=prometheus_faithfulness_eval_prompt_template, refine_template=prometheus_faithfulness_refine_prompt_template, ) # RelevancyEvaluator with Prometheus model prometheus_relevancy_evaluator = RelevancyEvaluator( llm=prometheus_llm, eval_template=prometheus_relevancy_eval_prompt_template, refine_template=prometheus_relevancy_refine_prompt_template, ) # Set the encoding model to `gpt-4` for token counting. token_counter = TokenCountingHandler( tokenizer=tiktoken.encoding_for_model("gpt-4").encode ) callback_manager = CallbackManager([token_counter]) gpt4_llm.callback_manager = callback_manager # CorrectnessEvaluator with GPT-4 model gpt4_correctness_evaluator = CorrectnessEvaluator( llm=gpt4_llm, # parser_function=parser_function, ) # FaithfulnessEvaluator with GPT-4 model gpt4_faithfulness_evaluator = FaithfulnessEvaluator( llm=gpt4_llm, eval_template=prometheus_faithfulness_eval_prompt_template, refine_template=prometheus_faithfulness_refine_prompt_template, ) # RelevancyEvaluator with GPT-4 model gpt4_relevancy_evaluator = RelevancyEvaluator( llm=gpt4_llm, eval_template=prometheus_relevancy_eval_prompt_template, refine_template=prometheus_relevancy_refine_prompt_template, ) # create a dictionary of evaluators prometheus_evaluators = { "correctness": prometheus_correctness_evaluator, "faithfulness": prometheus_faithfulness_evaluator, "relevancy": prometheus_relevancy_evaluator, } gpt4_evaluators = { "correctness": gpt4_correctness_evaluator, "faithfulness": gpt4_faithfulness_evaluator, "relevancy": gpt4_relevancy_evaluator, }

让我们创建一个函数来为不同的数据集创建 `query_engine` 和 `rag_dataset`。¶

In [ ]

已复制！





from llama_index.core.llama_dataset import LabelledRagDataset
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex


def create_query_engine_rag_dataset(dataset_path):
    rag_dataset = LabelledRagDataset.from_json(
        f"{dataset_path}/rag_dataset.json"
    )
    documents = SimpleDirectoryReader(
        input_dir=f"{dataset_path}/source_files"
    ).load_data()

    index = VectorStoreIndex.from_documents(documents=documents)
    query_engine = index.as_query_engine()

    return query_engine, rag_dataset
from llama_index.core.llama_dataset import LabelledRagDataset from llama_index.core import SimpleDirectoryReader, VectorStoreIndex def create_query_engine_rag_dataset(dataset_path): rag_dataset = LabelledRagDataset.from_json( f"{dataset_path}/rag_dataset.json" ) documents = SimpleDirectoryReader( input_dir=f"{dataset_path}/source_files" ).load_data() index = VectorStoreIndex.from_documents(documents=documents) query_engine = index.as_query_engine() return query_engine, rag_dataset

在定义的评估器上运行批量评估的函数¶

In [ ]

已复制！





from llama_index.core.evaluation import BatchEvalRunner


async def batch_eval_runner(
    evaluators, query_engine, questions, reference=None, num_workers=8
):
    batch_runner = BatchEvalRunner(
        evaluators, workers=num_workers, show_progress=True
    )

    eval_results = await batch_runner.aevaluate_queries(
        query_engine, queries=questions, reference=reference
    )

    return eval_results
from llama_index.core.evaluation import BatchEvalRunner async def batch_eval_runner( evaluators, query_engine, questions, reference=None, num_workers=8 ): batch_runner = BatchEvalRunner( evaluators, workers=num_workers, show_progress=True ) eval_results = await batch_runner.aevaluate_queries( query_engine, queries=questions, reference=reference ) return eval_results

检查分数分布的函数¶

In [ ]

已复制！





from collections import Counter
from typing import List, Dict


def get_scores_distribution(scores: List[float]) -> Dict[str, float]:
    # Counting the occurrences of each score
    score_counts = Counter(scores)

    # Total number of scores
    total_scores = len(scores)

    # Calculating the percentage distribution
    percentage_distribution = {
        score: (count / total_scores) * 100
        for score, count in score_counts.items()
    }

    return percentage_distribution
from collections import Counter from typing import List, Dict def get_scores_distribution(scores: List[float]) -> Dict[str, float]: # Counting the occurrences of each score score_counts = Counter(scores) # Total number of scores total_scores = len(scores) # Calculating the percentage distribution percentage_distribution = { score: (count / total_scores) * 100 for score, count in score_counts.items() } return percentage_distribution

检查正确性、忠实度和相关性评估分数的函数¶

In [ ]

已复制！





def get_eval_results(key, eval_results):
    results = eval_results[key]
    correct = 0
    for result in results:
        if result.passing:
            correct += 1
    score = correct / len(results)
    print(f"{key} Score: {round(score, 2)}")
    return score
def get_eval_results(key, eval_results): results = eval_results[key] correct = 0 for result in results: if result.passing: correct += 1 score = correct / len(results) print(f"{key} 分数: {round(score, 2)}") return score

计算 `Hamming Distance` 的函数。¶

In [ ]

已复制！

def hamming_distance(list1, list2):
    if len(list1) != len(list2):
        raise ValueError("Lists must be of the same length")
    return sum(el1 != el2 for el1, el2 in zip(list1, list2))
def hamming_distance(list1, list2): if len(list1) != len(list2): raise ValueError("Lists must be of the same length") return sum(el1 != el2 for el1, el2 in zip(list1, list2))

对 Paul Graham 文章文本进行评估¶

In [ ]

已复制！

query_engine, rag_dataset = create_query_engine_rag_dataset(
    "./data/paul_graham"
)
query_engine, rag_dataset = create_query_engine_rag_dataset( "./data/paul_graham" ) # 获取用于评估的问题 questions = [example.query for example in rag_dataset.examples] # 获取用于评估的参考答案 reference = [[example.reference_answer] for example in rag_dataset.examples]

In [ ]

已复制！

# Get questions for evaluation
questions = [example.query for example in rag_dataset.examples]

# Get reference answers for evaluation
reference = [[example.reference_answer] for example in rag_dataset.examples]
Compute Correctness, Faithfulness and Relevancy Evaluation¶

Compute Correctness, Faithfulness and Relevancy Evaluation¶

In [ ]

已复制！

prometheus_eval_results = await batch_eval_runner(
    prometheus_evaluators, query_engine, questions, reference
)
prometheus_eval_results = await batch_eval_runner( prometheus_evaluators, query_engine, questions, reference )

100%|██████████| 44/44 [00:30<00:00,  1.43it/s]
100%|██████████| 132/132 [01:56<00:00,  1.13it/s]

In [ ]

已复制！

gpt4_eval_results = await batch_eval_runner(
    gpt4_evaluators, query_engine, questions, reference
)
gpt4_eval_results = await batch_eval_runner( gpt4_evaluators, query_engine, questions, reference )

100%|██████████| 44/44 [00:26<00:00,  1.66it/s]
100%|██████████| 132/132 [02:32<00:00,  1.16s/it]

使用 Prometheus 评估器的正确性评估分数分布。¶

In [ ]

已复制！

prometheus_scores = [
    result.score for result in prometheus_eval_results["correctness"]
]
get_scores_distribution(prometheus_scores)
prometheus_scores = [ result.score for result in prometheus_eval_results["correctness"] ] get_scores_distribution(prometheus_scores)

Out[ ]

{3.0: 50.0,
 1.0: 43.18181818181818,
 5.0: 2.272727272727273,
 4.0: 4.545454545454546}

使用 GPT-4 评估器的正确性评估分数分布。¶

In [ ]

已复制！

gpt4_scores = [result.score for result in gpt4_eval_results["correctness"]]
get_scores_distribution(gpt4_scores)
gpt4_scores = [result.score for result in gpt4_eval_results["correctness"]] get_scores_distribution(gpt4_scores)

Out[ ]

{4.5: 50.0,
 5.0: 34.090909090909086,
 2.5: 9.090909090909092,
 4.0: 2.272727272727273,
 3.5: 4.545454545454546}

Prometheus 和 GPT-4 之间的反馈比较。¶

In [ ]

已复制！





query = prometheus_eval_results["correctness"][0].query
response = prometheus_eval_results["correctness"][0].response
reference_answer = reference[0][0]

# prometheus feedback and score
prometheus_feedback = prometheus_eval_results["correctness"][0].feedback
prometheus_score = prometheus_eval_results["correctness"][0].score

# GPT4 feedback and score
gpt4_feedback = gpt4_eval_results["correctness"][0].feedback
gpt4_score = gpt4_eval_results["correctness"][0].score
query = prometheus_eval_results["correctness"][0].query response = prometheus_eval_results["correctness"][0].response reference_answer = reference[0][0] # prometheus 反馈和分数 prometheus_feedback = prometheus_eval_results["correctness"][0].feedback prometheus_score = prometheus_eval_results["correctness"][0].score # GPT4 反馈和分数 gpt4_feedback = gpt4_eval_results["correctness"][0].feedback gpt4_score = gpt4_eval_results["correctness"][0].score

In [ ]

已复制！





print(f"Query: {query} \n\n")
print(f"Generated Answer: {response} \n\n")
print(f"Reference Answer: {reference_answer} \n\n")
print(
    f"Prometheus Feedback: {prometheus_feedback} \n\n {prometheus_score} \n\n"
)
print(f"GPT-4 Feedback: {gpt4_feedback} \n\n {gpt4_score}")
print(f"查询: {query} \n\n") print(f"生成答案: {response} \n\n") print(f"参考答案: {reference_answer} \n\n") print( f"Prometheus 反馈: {prometheus_feedback} \n\n {prometheus_score} \n\n" ) print(f"GPT-4 反馈: {gpt4_feedback} \n\n {gpt4_score}")

Query: In the essay, the author mentions his early experiences with programming. Describe the first computer he used for programming, the language he used, and the challenges he faced.

Generated Answer: The author mentions that the first computer he used for programming was the IBM 1401, which was located in the basement of his junior high school. He used an early version of Fortran as the programming language. The author faced challenges in figuring out what to do with the computer, as the only form of input was data stored on punched cards, and he didn't have any. Additionally, he didn't know enough math to do anything interesting with the computer.

Reference Answer: The first computer the author used for programming was the IBM 1401, which was used by his school district for data processing. He started using it in 9th grade, around the age of 13 or 14. The programming language he used was an early version of Fortran. The author faced several challenges while using this computer. The only form of input to programs was data stored on punched cards, and he didn't have any data stored on punched cards. The only other option was to do things that didn't rely on any input, like calculate approximations of pi, but he didn't know enough math to do anything interesting of that type. Therefore, he couldn't figure out what to do with it and in retrospect, he believes there's not much he could have done with it.

Prometheus Feedback: The generated response is relevant to the user query and correctly describes the first computer the author used for programming, the programming language he used, and the challenges he faced. However, it has some inaccuracies in the details. The author did not use the IBM 1401 in the basement of his junior high school, but rather in 9th grade, around the age of 13 or 14. The author did not have any data stored on punched cards, but the only form of input was data stored on punched cards. The author did not know enough math to do anything interesting with the computer, but he didn't know enough math to do anything interesting of that type. So the overall score is 3.

3.0

GPT-4 Feedback: The generated answer is highly relevant and almost completely accurate. It correctly identifies the first computer the author used (IBM 1401), the programming language (Fortran), and the challenges he faced (lack of input data and insufficient math knowledge). However, it omits the detail about the author's age and grade level when he started programming, which was included in the reference answer.

4.5

观察：¶

Prometheus 的反馈更详细，指出生成响应中遗漏了某些具体信息，导致得分为 3.0。相反，GPT-4 的反馈更宽泛且不够具体，尽管缺少一些细节，但给出了 5.0 的分数。

Prometheus 忠实度和相关性评估分数。¶

In [ ]

已复制！

_ = get_eval_results("faithfulness", prometheus_eval_results)

_ = get_eval_results("relevancy", prometheus_eval_results)
_ = get_eval_results("faithfulness", prometheus_eval_results) _ = get_eval_results("relevancy", prometheus_eval_results)

faithfulness Score: 0.75
relevancy Score: 0.86

GPT-4 忠实度和相关性评估分数。¶

In [ ]

已复制！

_ = get_eval_results("faithfulness", gpt4_eval_results)

_ = get_eval_results("relevancy", gpt4_eval_results)
_ = get_eval_results("faithfulness", gpt4_eval_results) _ = get_eval_results("relevancy", gpt4_eval_results)

faithfulness Score: 0.98
relevancy Score: 0.95

Prometheus 和 GPT-4 之间的 Hamming Distance 比较。¶

(越低越好)

In [ ]

已复制！





prometheus_faithfulness_scores = [
    result.score for result in prometheus_eval_results["faithfulness"]
]
prometheus_relevancy_scores = [
    result.score for result in prometheus_eval_results["relevancy"]
]

gpt4_faithfulness_scores = [
    result.score for result in gpt4_eval_results["faithfulness"]
]
gpt4_relevancy_scores = [
    result.score for result in gpt4_eval_results["relevancy"]
]

faithfulness_hamming_distance = hamming_distance(
    prometheus_faithfulness_scores, gpt4_faithfulness_scores
)
relevancy_hamming_distance = hamming_distance(
    prometheus_relevancy_scores, gpt4_relevancy_scores
)

print(f"Faithfulness Hamming Distance: {faithfulness_hamming_distance}")
print(f"Relevancy Hamming Distance: {relevancy_hamming_distance}")
prometheus_faithfulness_scores = [ result.score for result in prometheus_eval_results["faithfulness"] ] prometheus_relevancy_scores = [ result.score for result in prometheus_eval_results["relevancy"] ] gpt4_faithfulness_scores = [ result.score for result in gpt4_eval_results["faithfulness"] ] gpt4_relevancy_scores = [ result.score for result in gpt4_eval_results["relevancy"] ] faithfulness_hamming_distance = hamming_distance( prometheus_faithfulness_scores, gpt4_faithfulness_scores ) relevancy_hamming_distance = hamming_distance( prometheus_relevancy_scores, gpt4_relevancy_scores ) print(f"忠实度 Hamming Distance: {faithfulness_hamming_distance}") print(f"相关性 Hamming Distance: {relevancy_hamming_distance}")

Faithfulness Hamming Distance: 10
Relevancy Hamming Distance: 8

观察：¶

比较结果显示，Prometheus 和 GPT-4 的评估中，在 忠实度 和 相关性 方面，约有 77% 和 81% 的分数是相同的。这表明 Prometheus 和 GPT-4 模型在忠实度和相关性评分方面具有良好的相关性。

GPT-4 成本分析¶

In [ ]

已复制！

prompt_token_count = token_counter.prompt_llm_token_count
completion_token_count = token_counter.completion_llm_token_count

total_cost_paul_graham_essay = (
    prompt_token_count * 0.03 + completion_token_count * 0.06
) / 1000

token_counter.reset_counts()
prompt_token_count = token_counter.prompt_llm_token_count completion_token_count = token_counter.completion_llm_token_count total_cost_paul_graham_essay = ( prompt_token_count * 0.03 + completion_token_count * 0.06 ) / 1000 token_counter.reset_counts()

使用 Llama2 论文进行评估¶

In [ ]

已复制！

query_engine, rag_dataset = create_query_engine_rag_dataset("./data/llama2")
query_engine, rag_dataset = create_query_engine_rag_dataset("./data/llama2")

In [ ]

已复制！

questions = [example.query for example in rag_dataset.examples]
questions = [example.query for example in rag_dataset.examples]

In [ ]

已复制！

reference = [[example.reference_answer] for example in rag_dataset.examples]
reference = [[example.reference_answer] for example in rag_dataset.examples]

Compute Correctness, Faithfulness and Relevancy Evaluation¶

In [ ]

已复制！

prometheus_eval_results = await batch_eval_runner(
    prometheus_evaluators, query_engine, questions, reference
)
prometheus_eval_results = await batch_eval_runner( prometheus_evaluators, query_engine, questions, reference )

100%|██████████| 100/100 [01:02<00:00,  1.61it/s]
100%|██████████| 300/300 [04:34<00:00,  1.09it/s]

In [ ]

已复制！

gpt4_eval_results = await batch_eval_runner(
    gpt4_evaluators, query_engine, questions, reference
)
gpt4_eval_results = await batch_eval_runner( gpt4_evaluators, query_engine, questions, reference )

100%|██████████| 100/100 [01:06<00:00,  1.51it/s]
100%|██████████| 300/300 [06:22<00:00,  1.27s/it]

使用 Prometheus 评估器的正确性评估分数分布。¶

In [ ]

已复制！

prometheus_scores = [
    result.score for result in prometheus_eval_results["correctness"]
]
get_scores_distribution(prometheus_scores)
prometheus_scores = [ result.score for result in prometheus_eval_results["correctness"] ] get_scores_distribution(prometheus_scores)

Out[ ]

{3.0: 56.00000000000001, 1.0: 26.0, 5.0: 9.0, 4.0: 8.0, 2.0: 1.0}

使用 GPT-4 评估器的正确性评估分数分布。¶

In [ ]

已复制！

gpt4_scores = [result.score for result in gpt4_eval_results["correctness"]]
get_scores_distribution(gpt4_scores)
gpt4_scores = [result.score for result in gpt4_eval_results["correctness"]] get_scores_distribution(gpt4_scores)

Out[ ]

{4.5: 57.99999999999999,
 1.0: 6.0,
 4.0: 12.0,
 5.0: 10.0,
 2.0: 5.0,
 3.5: 5.0,
 2.5: 3.0,
 3.0: 1.0}

Prometheus 和 GPT-4 的正确性反馈比较。¶

In [ ]

已复制！





query = prometheus_eval_results["correctness"][0].query
response = prometheus_eval_results["correctness"][0].response
reference_answer = reference[0][0]

# prometheus feedback and score
prometheus_feedback = prometheus_eval_results["correctness"][0].feedback
prometheus_score = prometheus_eval_results["correctness"][0].score

# GPT4 feedback and score
gpt4_feedback = gpt4_eval_results["correctness"][0].feedback
gpt4_score = gpt4_eval_results["correctness"][0].score

print(f"Query: {query} \n\n")
print(f"Generated Answer: {response} \n\n")
print(f"Reference Answer: {reference_answer} \n\n")
print(
    f"Prometheus Feedback: {prometheus_feedback} \n\n {prometheus_score} \n\n"
)
print(f"GPT-4 Feedback: {gpt4_feedback} \n\n {gpt4_score}")
query = prometheus_eval_results["correctness"][0].query response = prometheus_eval_results["correctness"][0].response reference_answer = reference[0][0] # prometheus feedback and score prometheus_feedback = prometheus_eval_results["correctness"][0].feedback prometheus_score = prometheus_eval_results["correctness"][0].score # GPT4 feedback and score gpt4_feedback = gpt4_eval_results["correctness"][0].feedback gpt4_score = gpt4_eval_results["correctness"][0].score print(f"查询：{query} \n\n") print(f"生成的答案：{response} \n\n") print(f"参考答案：{reference_answer} \n\n") print( f"Prometheus 评估反馈：{prometheus_feedback} \n\n {prometheus_score} \n\n" ) print(f"GPT-4 评估反馈：{gpt4_feedback} \n\n {gpt4_score}")

Query: Based on the abstract of "Llama 2: Open Foundation and Fine-Tuned Chat Models," what are the two primary objectives achieved in this work, and what is the range of parameters for the large language models developed?

Generated Answer: The two primary objectives achieved in this work are the development and release of Llama 2, a collection of pretrained and fine-tuned large language models (LLMs), and the optimization of these models for dialogue use cases. The range of parameters for the large language models developed is from 7 billion to 70 billion.

Reference Answer: The two primary objectives achieved in the work described in the abstract of "Llama 2: Open Foundation and Fine-Tuned Chat Models" are:

1. The development and release of a collection of pretrained and fine-tuned large language models (LLMs) specifically optimized for dialogue use cases.
2. The demonstration that these fine-tuned LLMs, referred to as Llama 2-Chat, outperform open-source chat models on most benchmarks tested and may be a suitable substitute for closed-source models, particularly in terms of helpfulness and safety based on human evaluations.

The range of parameters for the large language models developed in this work is from 7 billion to 70 billion parameters.

Prometheus Feedback: The generated response is relevant to the user query and correctly identifies the two primary objectives of the work described in the abstract of "Llama 2: Open Foundation and Fine-Tuned Chat Models." However, it does not mention the demonstration of the fine-tuned LLMs outperforming open-source chat models on most benchmarks tested, which is a key point in the reference response. The range of parameters for the large language models developed is correctly identified, but the response does not mention the specific models referred to as Llama 2-Chat. So the overall score is 3.

3.0

GPT-4 Feedback: The generated answer is relevant and almost fully correct. It correctly identifies the two primary objectives and the range of parameters for the large language models. However, it misses the detail about Llama 2-Chat outperforming other models on most benchmarks and potentially being a suitable substitute for closed-source models.

4.5

观察：¶

与 GPT-4 相比，Prometheus 的反馈稍微更精确一些，并且会惩罚并给出 3.0 的分数，而 GPT-4 则给出 4.5 的分数。

Prometheus 忠实度和相关性评估分数。¶

In [ ]

已复制！

_ = get_eval_results("faithfulness", prometheus_eval_results)

_ = get_eval_results("relevancy", prometheus_eval_results)
_ = get_eval_results("faithfulness", prometheus_eval_results) _ = get_eval_results("relevancy", prometheus_eval_results)

faithfulness Score: 0.39
relevancy Score: 0.57

GPT-4 忠实度和相关性评估分数。¶

In [ ]

已复制！

_ = get_eval_results("faithfulness", gpt4_eval_results)

_ = get_eval_results("relevancy", gpt4_eval_results)
_ = get_eval_results("faithfulness", gpt4_eval_results) _ = get_eval_results("relevancy", gpt4_eval_results)

faithfulness Score: 0.93
relevancy Score: 0.98

Prometheus 和 GPT-4 之间的 Hamming Distance 比较。¶

In [ ]

已复制！





prometheus_faithfulness_scores = [
    result.score for result in prometheus_eval_results["faithfulness"]
]
prometheus_relevancy_scores = [
    result.score for result in prometheus_eval_results["relevancy"]
]

gpt4_faithfulness_scores = [
    result.score for result in gpt4_eval_results["faithfulness"]
]
gpt4_relevancy_scores = [
    result.score for result in gpt4_eval_results["relevancy"]
]

faithfulness_hamming_distance = hamming_distance(
    prometheus_faithfulness_scores, gpt4_faithfulness_scores
)
relevancy_hamming_distance = hamming_distance(
    prometheus_relevancy_scores, gpt4_relevancy_scores
)

print(f"Faithfulness Hamming Distance: {faithfulness_hamming_distance}")
print(f"Relevancy Hamming Distance: {relevancy_hamming_distance}")
prometheus_faithfulness_scores = [ result.score for result in prometheus_eval_results["faithfulness"] ] prometheus_relevancy_scores = [ result.score for result in prometheus_eval_results["relevancy"] ] gpt4_faithfulness_scores = [ result.score for result in gpt4_eval_results["faithfulness"] ] gpt4_relevancy_scores = [ result.score for result in gpt4_eval_results["relevancy"] ] faithfulness_hamming_distance = hamming_distance( prometheus_faithfulness_scores, gpt4_faithfulness_scores ) relevancy_hamming_distance = hamming_distance( prometheus_relevancy_scores, gpt4_relevancy_scores ) print(f"忠实度 Hamming Distance: {faithfulness_hamming_distance}") print(f"相关性 Hamming Distance: {relevancy_hamming_distance}")

Faithfulness Hamming Distance: 58
Relevancy Hamming Distance: 41

观察：¶

比较结果显示，Prometheus 和 GPT-4 的评估中，在 忠实度 方面约有 44% 的分数相同，在 相关性 方面约有 63% 的分数相同。这表明 Prometheus 和 GPT-4 模型在忠实度和相关性评分方面具有相当程度的相关性。

Prometheus 和 GPT-4 的忠实度和相关性反馈比较¶

In [ ]

已复制！





# Get the query
query = questions[0]

# Get the response/ generated answer for the query
response = prometheus_eval_results["faithfulness"][0].response
# Get the retrieved contexts as they are used for faithfulness and relevancy
contexts = prometheus_eval_results["faithfulness"][0].contexts

# Get the faithfulness and relevancy feedbacks from prometheus model
prometheus_faithfulness_feedback = prometheus_eval_results["faithfulness"][
    0
].feedback
prometheus_relevancy_feedback = prometheus_eval_results["relevancy"][
    0
].feedback

# Get the faithfulness and relevancy feedbacks from gpt4 model
gpt4_faithfulness_feedback = gpt4_eval_results["faithfulness"][0].feedback
gpt4_relevancy_feedback = gpt4_eval_results["relevancy"][0].feedback

# Get the failthfulness and relevancy scores from prometheus model
prometheus_faithfulness_score = prometheus_eval_results["faithfulness"][
    0
].score
prometheus_relevancy_score = prometheus_eval_results["relevancy"][0].score

# Get the faithfulness and relevancy scores from gpt4 model
gpt4_faithfulness_score = gpt4_eval_results["faithfulness"][0].score
gpt4_relevancy_score = gpt4_eval_results["relevancy"][0].score
# Get the query query = questions[0] # Get the response/ generated answer for the query response = prometheus_eval_results["faithfulness"][0].response # Get the retrieved contexts as they are used for faithfulness and relevancy contexts = prometheus_eval_results["faithfulness"][0].contexts # Get the faithfulness and relevancy feedbacks from prometheus model prometheus_faithfulness_feedback = prometheus_eval_results["faithfulness"][ 0 ].feedback prometheus_relevancy_feedback = prometheus_eval_results["relevancy"][ 0 ].feedback # Get the faithfulness and relevancy feedbacks from gpt4 model gpt4_faithfulness_feedback = gpt4_eval_results["faithfulness"][0].feedback gpt4_relevancy_feedback = gpt4_eval_results["relevancy"][0].feedback # Get the failthfulness and relevancy scores from prometheus model prometheus_faithfulness_score = prometheus_eval_results["faithfulness"][ 0 ].score prometheus_relevancy_score = prometheus_eval_results["relevancy"][0].score # Get the faithfulness and relevancy scores from gpt4 model gpt4_faithfulness_score = gpt4_eval_results["faithfulness"][0].score gpt4_relevancy_score = gpt4_eval_results["relevancy"][0].score

In [ ]

已复制！

print(f"Query: {query} \n\n")
print(f"Generated Answer: {response}")
print(f"查询：{query} \n\n") print(f"生成的答案：{response}")

Query: Based on the abstract of "Llama 2: Open Foundation and Fine-Tuned Chat Models," what are the two primary objectives achieved in this work, and what is the range of parameters for the large language models developed? 


Generated Answer: The two primary objectives achieved in this work are the development and release of Llama 2, a collection of pretrained and fine-tuned large language models (LLMs), and the optimization of these models for dialogue use cases. The range of parameters for the large language models developed is from 7 billion to 70 billion.

In [ ]

已复制！

print(f"Context-1: {contexts[0]}")
print(f"上下文-1：{contexts[0]}")

Context-1: Llama 2 : Open Foundation and Fine-Tuned Chat Models
Hugo Touvron∗Louis Martin†Kevin Stone†
Peter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra
Prajjwal Bhargava Shruti Bhosale Dan Bikel Lukas Blecher Cristian Canton Ferrer Moya Chen
Guillem Cucurull David Esiobu Jude Fernandes Jeremy Fu Wenyin Fu Brian Fuller
Cynthia Gao Vedanuj Goswami Naman Goyal Anthony Hartshorn Saghar Hosseini Rui Hou
Hakan Inan Marcin Kardas Viktor Kerkez Madian Khabsa Isabel Kloumann Artem Korenev
Punit Singh Koura Marie-Anne Lachaux Thibaut Lavril Jenya Lee Diana Liskovich
Yinghai Lu Yuning Mao Xavier Martinet Todor Mihaylov Pushkar Mishra
Igor Molybog Yixin Nie Andrew Poulton Jeremy Reizenstein Rashi Rungta Kalyan Saladi
Alan Schelten Ruan Silva Eric Michael Smith Ranjan Subramanian Xiaoqing Ellen Tan Binh Tang
Ross Taylor Adina Williams Jian Xiang Kuan Puxin Xu Zheng Yan Iliyan Zarov Yuchen Zhang
Angela Fan Melanie Kambadur Sharan Narang Aurelien Rodriguez Robert Stojnic
Sergey Edunov Thomas Scialom∗
GenAI, Meta
Abstract
In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned
large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters.
Our fine-tuned LLMs, called Llama 2-Chat , are optimized for dialogue use cases. Our
models outperform open-source chat models on most benchmarks we tested, and based on
ourhumanevaluationsforhelpfulnessandsafety,maybeasuitablesubstituteforclosed-
source models. We provide a detailed description of our approach to fine-tuning and safety
improvements of Llama 2-Chat in order to enable the community to build on our work and
contribute to the responsible development of LLMs.
∗Equal contribution, corresponding authors: {tscialom, htouvron}@meta.com
†Second author
Contributions for all the authors can be found in Section A.1.arXiv:2307.09288v2  [cs.CL]  19 Jul 2023

In [ ]

已复制！

print(f"Context-2: {contexts[1]}")
print(f"上下文-2：{contexts[1]}")

Context-2: (2021)alsoilluminatesthedifficultiestiedtochatbot-oriented
LLMs, with concerns ranging from privacy to misleading expertise claims. Deng et al. (2023) proposes
a taxonomic framework to tackle these issues, and Bergman et al. (2022) delves into the balance between
potential positive and negative impacts from releasing dialogue models.
InvestigationsintoredteamingrevealspecificchallengesintunedLLMs,withstudiesbyGangulietal.(2022)
and Zhuoet al. (2023) showcasing a variety ofsuccessful attack typesand their effects onthe generation of
harmful content. National security agencies and various researchers, such as (Mialon et al., 2023), have also
raisedredflagsaroundadvancedemergentmodelbehaviors,cyberthreats,andpotentialmisuseinareaslike
biological warfare. Lastly, broader societal issues like job displacement due to accelerated AI research and an
over-reliance on LLMs leading to training data degradation are also pertinent considerations (Acemoglu
andRestrepo,2018;AutorandSalomons,2018;Webb,2019;Shumailovetal.,2023). Wearecommittedto
continuing our work engaging with the broader policy, academic, and industry community on these issues.
7 Conclusion
Inthisstudy,wehaveintroduced Llama 2,anewfamilyofpretrainedandfine-tunedmodelswithscales
of7billionto70billionparameters. Thesemodelshavedemonstratedtheircompetitivenesswithexisting
open-source chat models, as well as competency that is equivalent to some proprietary models on evaluation
setsweexamined,althoughtheystilllagbehindothermodelslikeGPT-4. Wemeticulouslyelaboratedonthe
methodsandtechniquesappliedinachievingourmodels,withaheavyemphasisontheiralignmentwiththe
principlesofhelpfulnessandsafety. Tocontributemoresignificantlytosocietyandfosterthepaceofresearch,
wehaveresponsiblyopenedaccessto Llama 2 andLlama 2-Chat . Aspartofourongoingcommitmentto
transparency and safety, we plan to make further improvements to Llama 2-Chat in future work.
36

In [ ]

已复制！





print(
    f"Prometheus Faithfulness Feedback: {prometheus_faithfulness_feedback}\n\n"
)
print(f"Prometheus Faithfulness Score: {prometheus_faithfulness_score}\n\n")
print(f"Prometheus Relevancy Feedback: {prometheus_relevancy_feedback}\n\n")
print(f"Prometheus Relevancy Score: {prometheus_relevancy_score}")
print( f"Prometheus 忠实度评估反馈：{prometheus_faithfulness_feedback}\n\n" ) print(f"Prometheus 忠实度评分：{prometheus_faithfulness_score}\n\n") print(f"Prometheus 相关性评估反馈：{prometheus_relevancy_feedback}\n\n") print(f"Prometheus 相关性评分：{prometheus_relevancy_score}")

Prometheus Faithfulness Feedback: 
        The information provided in the context is not supported by the given information. The context is about the development and release of Llama 2, a collection of pretrained and fine-tuned large language models (LLMs), and the optimization of these models for dialogue use cases. However, the information provided in the context does not align with the given information. The context does not mention the range of parameters for the large language models developed, which is the primary objective mentioned in the information. The context only talks about the development and release of Llama 2 and its optimization for dialogue use cases, but it does not provide any information about the range of parameters for the large language models developed. So the overall score is NO. [RESULT] NO


Prometheus Faithfulness Score: 0.0


Prometheus Relevancy Feedback: 
        The response is not in line with the context information provided. The query asked for the two primary objectives achieved in the work and the range of parameters for the large language models developed. However, the response provided the abstract of the paper and mentioned the authors, which is not relevant to the query. The response also did not mention the two primary objectives achieved in the work or the range of parameters for the large language models developed. So the overall score is NO. [RESULT] NO


Prometheus Relevancy Score: 0.0

如果您比较评估反馈和上下文，会发现上下文和答案中提到了参数范围，但评估反馈却说模型找不到这些信息。¶

In [ ]

已复制！

print(f"GPT-4 Faithfulness Feedback: {gpt4_faithfulness_feedback}\n\n")
print(f"GPT-4 Faithfulness Score: {gpt4_faithfulness_score}\n\n")
print(f"GPT-4 Relevancy Feedback: {gpt4_relevancy_feedback}\n\n")
print(f"GPT-4 Relevancy Score: {gpt4_relevancy_score}")
print(f"GPT-4 忠实度评估反馈：{gpt4_faithfulness_feedback}\n\n") print(f"GPT-4 忠实度评分：{gpt4_faithfulness_score}\n\n") print(f"GPT-4 相关性评估反馈：{gpt4_relevancy_feedback}\n\n") print(f"GPT-4 相关性评分：{gpt4_relevancy_score}")

GPT-4 Faithfulness Feedback: The given piece of information is well supported by the context. The context clearly states that Llama 2, a collection of pretrained and fine-tuned large language models (LLMs), was developed and released. It also mentions that these models range in scale from 7 billion to 70 billion parameters. Furthermore, the context confirms that these models are optimized for dialogue use cases. Therefore, the information provided is accurate and is corroborated by the context. [RESULT] YES


GPT-4 Faithfulness Score: 1.0


GPT-4 Relevancy Feedback: The response accurately reflects the context provided. The response correctly identifies the two primary objectives of the work as the development and release of Llama 2, a collection of pretrained and fine-tuned large language models (LLMs), and the optimization of these models for dialogue use cases. This is in line with the information provided in the abstract of the context. The response also correctly states the range of parameters for the large language models developed as being from 7 billion to 70 billion, which is also confirmed in the context. Therefore, the response is in line with the context information provided. [RESULT] YES


GPT-4 Relevancy Score: 1.0

与 Prometheus 模型不同，GPT-4 对此评估是正确的。¶

GPT-4 成本分析¶

In [ ]

已复制！

prompt_token_count = token_counter.prompt_llm_token_count
completion_token_count = token_counter.completion_llm_token_count

total_cost_llama2 = (
    prompt_token_count * 0.03 + completion_token_count * 0.06
) / 1000
prompt_token_count = token_counter.prompt_llm_token_count completion_token_count = token_counter.completion_llm_token_count total_cost_llama2 = ( prompt_token_count * 0.03 + completion_token_count * 0.06 ) / 1000

总成本分析¶

Prometheus 模型 - 对于 `144` 个查询（Paul Graham 文章为 `44` 个，Llama2 论文为 `100` 个），成本为 `$2.167`，即每个查询 `$0.015`。¶

GPT-4 模型 - `$22` (total_cost_paul_graham_essay + total_cost_llama2) - 即每个查询 `$0.15`。¶

观察：¶

评估成本（约）：Prometheus 模型为 $2.167，GPT-4 为 $22。
尽管 Prometheus 模型提供了比 GPT-4 更详细的反馈，但偶尔也会提供不正确的反馈，因此需要谨慎使用。
如果生成的答案缺少参考答案中的某些事实，Prometheus 模型会比 GPT-4 对分数施加更严格的惩罚。
与 GPT-4 相比，Prometheus 的忠实度和相关性反馈在反馈中表现出更多的幻觉/错误解释。
Prometheus 和 GPT-4 的忠实度和相关性分数在两个数据集上的共同性不同，因此在生产环境中应谨慎使用。

注意：HF 上的端点部署在 AWS Nvidia A100G · 1x GPU · 80 GB 上，成本为 $6.5/h。此处我们使用Prometheus 模型进行分析。我们还使用量化版本的 Prometheus 模型（原始模型）进行了类似分析，并观察到与原始未量化模型相比，反馈中的幻觉稍多。感谢论文作者和 Tom Jobbins 提供了模型的量化版本。

使用 Prometheus 模型进行评估¶

注意：此处展示的是用于分析的原始 Prometheus 模型。您可以使用模型的量化版本重新运行分析。¶

下载数据集¶

定义托管在 HuggingFace 上的 Prometheus LLM。¶

提示模板。¶

正确性评估提示¶

忠实度评估提示¶

相关性评估提示¶

定义解析器函数¶

定义正确性、忠实度和相关性评估器¶

让我们创建一个函数来为不同的数据集创建 query_engine 和 rag_dataset。¶

在定义的评估器上运行批量评估的函数¶

检查分数分布的函数¶

检查正确性、忠实度和相关性评估分数的函数¶

计算 Hamming Distance 的函数。¶

对 Paul Graham 文章文本进行评估¶

Compute Correctness, Faithfulness and Relevancy Evaluation¶

使用 Prometheus 评估器的正确性评估分数分布。¶

使用 GPT-4 评估器的正确性评估分数分布。¶

Prometheus 和 GPT-4 之间的反馈比较。¶

观察：¶

Prometheus 忠实度和相关性评估分数。¶

GPT-4 忠实度和相关性评估分数。¶

Prometheus 和 GPT-4 之间的 Hamming Distance 比较。¶

观察：¶

GPT-4 成本分析¶

使用 Llama2 论文进行评估¶

Compute Correctness, Faithfulness and Relevancy Evaluation¶

使用 Prometheus 评估器的正确性评估分数分布。¶

使用 GPT-4 评估器的正确性评估分数分布。¶

Prometheus 和 GPT-4 的正确性反馈比较。¶

观察：¶

Prometheus 忠实度和相关性评估分数。¶

GPT-4 忠实度和相关性评估分数。¶

Prometheus 和 GPT-4 之间的 Hamming Distance 比较。¶

观察：¶

Prometheus 和 GPT-4 的忠实度和相关性反馈比较¶

如果您比较评估反馈和上下文，会发现上下文和答案中提到了参数范围，但评估反馈却说模型找不到这些信息。¶

与 Prometheus 模型不同，GPT-4 对此评估是正确的。¶

GPT-4 成本分析¶

总成本分析¶

Prometheus 模型 - 对于 144 个查询（Paul Graham 文章为 44 个，Llama2 论文为 100 个），成本为 $2.167，即每个查询 $0.015。¶

GPT-4 模型 - $22 (total_cost_paul_graham_essay + total_cost_llama2) - 即每个查询 $0.15。¶

观察：¶

让我们创建一个函数来为不同的数据集创建 `query_engine` 和 `rag_dataset`。¶

计算 `Hamming Distance` 的函数。¶

Prometheus 模型 - 对于 `144` 个查询（Paul Graham 文章为 `44` 个，Llama2 论文为 `100` 个），成本为 `$2.167`，即每个查询 `$0.015`。¶

GPT-4 模型 - `$22` (total_cost_paul_graham_essay + total_cost_llama2) - 即每个查询 `$0.15`。¶