答案相关性和上下文相关性评估¶

在本 Jupyter Notebook 中，我们将演示如何利用 AnswerRelevancyEvaluator 和 ContextRelevancyEvaluator 类分别衡量生成答案和检索到的上下文与给定用户查询的相关性。这两个评估器都会返回一个介于 0 和 1 之间的 score 以及解释分数的生成 feedback。请注意，分数越高表示相关性越高。特别是，我们提示评审 LLM 采用逐步方法提供相关性分数，要求它回答以下关于生成答案对查询的答案相关性（对于上下文相关性，这些问题略有调整）的两个问题：

提供的响应是否与用户查询的主题相符？
提供的响应是否试图解决用户查询对主题所采取的重点或视角？

每个问题价值 1 分，因此完美的评估将得到 2/2 的分数。

In [ ]

已复制!

%pip install llama-index-llms-openai
%pip install llama-index-llms-openai

In [ ]

已复制!

import nest_asyncio
from tqdm.asyncio import tqdm_asyncio

nest_asyncio.apply()
import nest_asyncio from tqdm.asyncio import tqdm_asyncio nest_asyncio.apply()

In [ ]

已复制!





def displayify_df(df):
    """For pretty displaying DataFrame in a notebook."""
    display_df = df.style.set_properties(
        **{
            "inline-size": "300px",
            "overflow-wrap": "break-word",
        }
    )
    display(display_df)
def displayify_df(df): """在 notebook 中漂亮地显示 DataFrame""" display_df = df.style.set_properties( **{ "inline-size": "300px", "overflow-wrap": "break-word", } ) display(display_df)

下载数据集 (`LabelledRagDataset`)¶

为了本次演示，我们将使用通过 llama-hub 提供的 llama-dataset。

In [ ]

已复制!





from llama_index.core.llama_dataset import download_llama_dataset
from llama_index.core.llama_pack import download_llama_pack
from llama_index.core import VectorStoreIndex

# download and install dependencies for benchmark dataset
rag_dataset, documents = download_llama_dataset(
    "EvaluatingLlmSurveyPaperDataset", "./data"
)
from llama_index.core.llama_dataset import download_llama_dataset from llama_index.core.llama_pack import download_llama_pack from llama_index.core import VectorStoreIndex # 下载并安装基准数据集的依赖项 rag_dataset, documents = download_llama_dataset( "EvaluatingLlmSurveyPaperDataset", "./data" )

In [ ]

已复制!

rag_dataset.to_pandas()[:5]
rag_dataset.to_pandas()[:5]

Out [ ]

	query	reference_contexts	reference_answer	reference_answer_by	query_by
0	大型语言模型 (LLMs) 相关...	[评估大型语言模型：一项\n全面...	根据上下文信息，潜在...	ai (gpt-3.5-turbo)	ai (gpt-3.5-turbo)
1	该调查如何对 LLMs 的评估进行分类...	[评估大型语言模型：一项\n全面...	该调查将 LLMs 的评估分类为...	ai (gpt-3.5-turbo)	ai (gpt-3.5-turbo)
2	讨论了哪些不同类型的推理...	[目录\n1 引言 4\n2 分类与路线图...	文中讨论的不同类型的推理...	ai (gpt-3.5-turbo)	ai (gpt-3.5-turbo)
3	语言模型中的毒性是如何评估的...	[目录\n1 引言 4\n2 分类与路线图...	语言模型中的毒性是根据... 评估的...	ai (gpt-3.5-turbo)	ai (gpt-3.5-turbo)
4	在专业化 LLMs 评估的背景下...	[5.1.3 对齐鲁棒性 . . . . . . . . . ...	在专业化 LLMs 评估的背景下...	ai (gpt-3.5-turbo)	ai (gpt-3.5-turbo)

接下来，我们在用于创建 rag_dataset 的相同源文档之上构建一个 RAG。

In [ ]

已复制!

index = VectorStoreIndex.from_documents(documents=documents)
query_engine = index.as_query_engine()
index = VectorStoreIndex.from_documents(documents=documents) query_engine = index.as_query_engine()

定义了我们的 RAG（即 query_engine）后，我们可以使用它对 rag_dataset 进行预测（即，生成对查询的响应）。

In [ ]

已复制!

prediction_dataset = await rag_dataset.amake_predictions_with(
    predictor=query_engine, batch_size=100, show_progress=True
)
prediction_dataset = await rag_dataset.amake_predictions_with( predictor=query_engine, batch_size=100, show_progress=True )

Batch processing of predictions: 100%|████████████████████| 100/100 [00:08<00:00, 12.12it/s]
Batch processing of predictions: 100%|████████████████████| 100/100 [00:08<00:00, 12.37it/s]
Batch processing of predictions: 100%|██████████████████████| 76/76 [00:06<00:00, 10.93it/s]

分别评估答案和上下文相关性¶

我们首先需要定义我们的评估器（即 AnswerRelevancyEvaluator 和 ContextRelevancyEvaluator）

In [ ]

已复制!





# instantiate the gpt-4 judges
from llama_index.llms.openai import OpenAI
from llama_index.core.evaluation import (
    AnswerRelevancyEvaluator,
    ContextRelevancyEvaluator,
)

judges = {}

judges["answer_relevancy"] = AnswerRelevancyEvaluator(
    llm=OpenAI(temperature=0, model="gpt-3.5-turbo"),
)

judges["context_relevancy"] = ContextRelevancyEvaluator(
    llm=OpenAI(temperature=0, model="gpt-4"),
)
# 实例化 gpt-4 评审模型 from llama_index.llms.openai import OpenAI from llama_index.core.evaluation import ( AnswerRelevancyEvaluator, ContextRelevancyEvaluator, ) judges = {} judges["answer_relevancy"] = AnswerRelevancyEvaluator( llm=OpenAI(temperature=0, model="gpt-3.5-turbo"), ) judges["context_relevancy"] = ContextRelevancyEvaluator( llm=OpenAI(temperature=0, model="gpt-4"), )

现在，我们可以通过遍历所有 <example, prediction> 对来使用我们的评估器进行评估。

In [ ]

已复制!





eval_tasks = []
for example, prediction in zip(
    rag_dataset.examples, prediction_dataset.predictions
):
    eval_tasks.append(
        judges["answer_relevancy"].aevaluate(
            query=example.query,
            response=prediction.response,
            sleep_time_in_seconds=1.0,
        )
    )
    eval_tasks.append(
        judges["context_relevancy"].aevaluate(
            query=example.query,
            contexts=prediction.contexts,
            sleep_time_in_seconds=1.0,
        )
    )
eval_tasks = [] for example, prediction in zip( rag_dataset.examples, prediction_dataset.predictions ): eval_tasks.append( judges["answer_relevancy"].aevaluate( query=example.query, response=prediction.response, sleep_time_in_seconds=1.0, ) ) eval_tasks.append( judges["context_relevancy"].aevaluate( query=example.query, contexts=prediction.contexts, sleep_time_in_seconds=1.0, ) )

In [ ]

已复制!

eval_results1 = await tqdm_asyncio.gather(*eval_tasks[:250])
eval_results1 = await tqdm_asyncio.gather(*eval_tasks[:250])

100%|█████████████████████████████████████████████████████| 250/250 [00:28<00:00,  8.85it/s]

In [ ]

已复制!

eval_results2 = await tqdm_asyncio.gather(*eval_tasks[250:])
eval_results2 = await tqdm_asyncio.gather(*eval_tasks[250:])

100%|█████████████████████████████████████████████████████| 302/302 [00:31<00:00,  9.62it/s]

In [ ]

已复制!

eval_results = eval_results1 + eval_results2
eval_results = eval_results1 + eval_results2

In [ ]

已复制!

evals = {
    "answer_relevancy": eval_results[::2],
    "context_relevancy": eval_results[1::2],
}
evals = { "answer_relevancy": eval_results[::2], "context_relevancy": eval_results[1::2], }

查看评估结果¶

这里我们使用一个实用函数将 EvaluationResult 对象列表转换为更适合在 notebook 中显示的形式。这个实用函数将提供两个 DataFrame，一个包含所有评估结果的详细 DataFrame，另一个则通过计算每种评估方法的得分平均值进行汇总。

In [ ]

已复制!





from llama_index.core.evaluation.notebook_utils import get_eval_results_df
import pandas as pd

deep_dfs = {}
mean_dfs = {}
for metric in evals.keys():
    deep_df, mean_df = get_eval_results_df(
        names=["baseline"] * len(evals[metric]),
        results_arr=evals[metric],
        metric=metric,
    )
    deep_dfs[metric] = deep_df
    mean_dfs[metric] = mean_df
from llama_index.core.evaluation.notebook_utils import get_eval_results_df import pandas as pd deep_dfs = {} mean_dfs = {} for metric in evals.keys(): deep_df, mean_df = get_eval_results_df( names=["baseline"] * len(evals[metric]), results_arr=evals[metric], metric=metric, ) deep_dfs[metric] = deep_df mean_dfs[metric] = mean_df

In [ ]

已复制!





mean_scores_df = pd.concat(
    [mdf.reset_index() for _, mdf in mean_dfs.items()],
    axis=0,
    ignore_index=True,
)
mean_scores_df = mean_scores_df.set_index("index")
mean_scores_df.index = mean_scores_df.index.set_names(["metrics"])
mean_scores_df
mean_scores_df = pd.concat( [mdf.reset_index() for _, mdf in mean_dfs.items()], axis=0, ignore_index=True, ) mean_scores_df = mean_scores_df.set_index("index") mean_scores_df.index = mean_scores_df.index.set_names(["metrics"]) mean_scores_df

Out [ ]

rag	baseline
metrics
mean_answer_relevancy_score	0.914855
mean_context_relevancy_score	0.572273

上述实用函数还提供了 mean_df 中所有评估的平均得分。

通过对 deep_df 调用 value_counts()，我们可以查看分数的原始分布。

In [ ]

已复制!

deep_dfs["answer_relevancy"]["scores"].value_counts()
deep_dfs["answer_relevancy"]["scores"].value_counts()

Out [ ]

scores
1.0    250
0.0     21
0.5      5
Name: count, dtype: int64

In [ ]

已复制!

deep_dfs["context_relevancy"]["scores"].value_counts()
deep_dfs["context_relevancy"]["scores"].value_counts()

Out [ ]

scores
1.000    89
0.000    70
0.750    49
0.250    23
0.625    14
0.500    11
0.375    10
0.875     9
Name: count, dtype: int64

大多数情况下，默认 RAG 在生成与查询相关的答案方面表现相当不错。通过查看任何 deep_df 的记录，可以进一步了解详细信息。

In [ ]

已复制!

displayify_df(deep_dfs["context_relevancy"].head(2))
displayify_df(deep_dfs["context_relevancy"].head(2))

	rag	query	answer	contexts	scores	feedbacks
0	baseline	根据上下文信息，大型语言模型（LLMs）的潜在风险是什么？	无	['Evaluating Large Language Models: A\nComprehensive Survey\nZishan Guo∗, Renren Jin∗, Chuang Liu∗, Yufei Huang, Dan Shi, Supryadi\nLinhao Yu, Yan Liu, Jiaxuan Li, Bojian Xiong, Deyi Xiong†\nTianjin University\n{guozishan, rrjin, liuc_09, yuki_731, shidan, supryadi}@tju.edu.cn\n{linhaoyu, yan_liu, jiaxuanlee, xbj1355, dyxiong}@tju.edu.cn\nAbstract\nLarge language models (LLMs) have demonstrated remarkable capabilities\nacross a broad spectrum of tasks. They have attracted significant attention\nand been deployed in numerous downstream applications. Nevertheless, akin\nto a double-edged sword, LLMs also present potential risks. They could\nsuffer from private data leaks or yield inappropriate, harmful, or misleading\ncontent. Additionally, the rapid progress of LLMs raises concerns about the\npotential emergence of superintelligent systems without adequate safeguards.\nTo effectively capitalize on LLM capacities as well as ensure their safe and\nbeneficial development, it is critical to conduct a rigorous and comprehensive\nevaluation of LLMs.\nThis survey endeavors to offer a panoramic perspective on the evaluation\nof LLMs. We categorize the evaluation of LLMs into three major groups:\nknowledgeandcapabilityevaluation, alignmentevaluationandsafetyevaluation.\nIn addition to the comprehensive review on the evaluation methodologies and\nbenchmarks on these three aspects, we collate a compendium of evaluations\npertaining to LLMs’ performance in specialized domains, and discuss the\nconstruction of comprehensive evaluation platforms that cover LLM evaluations\non capabilities, alignment, safety, and applicability.\nWe hope that this comprehensive overview will stimulate further research\ninterests in the evaluation of LLMs, with the ultimate goal of making evaluation\nserve as a cornerstone in guiding the responsible development of LLMs. We\nenvision that this will channel their evolution into a direction that maximizes\nsocietal benefit while minimizing potential risks. A curated list of related\npapers has been publicly available at a GitHub repository.1\n∗Equal contribution\n†Corresponding author.\n1https://github.com/tjunlp-lab/Awesome-LLMs-Evaluation-Papers\n1arXiv:2310.19736v3 [cs.CL] 25 Nov 2023', 'criteria. Multilingual Holistic Bias (Costa-jussà et al., 2023) extends the HolisticBias dataset\nto 50 languages, achieving the largest scale of English template-based text expansion.\nWhether using automatic or manual evaluations, both approaches inevitably carry human\nsubjectivity and cannot establish a comprehensive and fair evaluation standard. Unqover\n(Li et al., 2020) is the first to transform the task of evaluating biases generated by models\ninto a multiple-choice question, covering gender, nationality, race, and religion categories.\nThey provide models with ambiguous and disambiguous contexts and ask them to choose\nbetween options with and without stereotypes, evaluating both PLMs and models fine-tuned\non multiple-choice question answering datasets. BBQ (Parrish et al., 2022) adopts this\napproach but extends the types of biases to nine categories. All sentence templates are\nmanually created, and in addition to the two contrasting group answers, the model is also\nprovided with correct answers like “I don’t know” and “I’m not sure”, and a statistical bias\nscore metric is proposed to evaluate multiple question answering models. CBBQ (Huang\n& Xiong, 2023) extends BBQ to Chinese. Based on Chinese socio-cultural factors, CBBQ\nadds four categories: disease, educational qualification, household registration, and region.\nThey manually rewrite ambiguous text templates and use GPT-4 to generate disambiguous\ntemplates, greatly increasing the dataset’s diversity and extensibility. Additionally, they\nimprove the experimental setup for LLMs and evaluate existing Chinese open-source LLMs,\nfinding that current Chinese LLMs not only have higher bias scores but also exhibit behavioral\ninconsistencies, revealing a significant gap compared to GPT-3.5-Turbo.\nIn addition to these aforementioned evaluation methods, we could also use advanced LLMs for\nscoring bias, such as GPT-4, or employ models that perform best in training bias detection\ntasks to detect the level of bias in answers. Such models can be used not only in the evaluation\nphase but also for identifying biases in data for pre-training LLMs, facilitating debiasing in\ntraining data.\nAs the development of multilingual LLMs and domain-specific LLMs progresses, studies on\nthe fairness of these models become increasingly important. Zhao et al. (2020) create datasets\nto study gender bias in multilingual embeddings and cross-lingual tasks, revealing gender\nbias from both internal and external perspectives. Moreover, FairLex (Chalkidis et al., 2022)\nproposes a multilingual legal dataset as fairness benchmark, covering four judicial jurisdictions\n(European Commission, United States, Swiss Federation, and People’s Republic of China), five\nlanguages (English, German, French, Italian, and Chinese), and various sensitive attributes\n(gender, age, region, etc.). As LLMs have been applied and deployed in the finance and legal\nsectors, these studies deserve high attention.\n4.3 Toxicity\nLLMs are usually trained on a huge amount of online data which may contain toxic behavior\nand unsafe content. These include hate speech, offensive/abusive language, pornographic\ncontent, etc. It is hence very desirable to evaluate how well trained LLMs deal with toxicity.\nConsidering the proficiency of LLMs in understanding and generating sentences, we categorize\nthe evaluation of toxicity into two tasks: toxicity identification and classification evaluation,\nand the evaluation of toxicity in generated sentences.\n29']	1.000000	1. 检索到的上下文确实与用户查询的主题相符。它讨论了与大型语言模型 (LLMs) 相关的潜在风险，包括私人数据泄露、不恰当或有害内容以及缺乏足够保障的超智能系统的出现。它还讨论了 LLMs 中存在的偏见的可能性，以及 LLMs 生成内容中的毒性风险。因此，它与用户关于 LLMs 潜在风险的查询相关。(2/2) 2. 检索到的上下文可以用来提供对用户查询的完整答案。它提供了与 LLMs 相关的潜在风险的全面概述，包括数据隐私、不恰当内容、超智能、偏见和毒性。它还讨论了评估这些风险的重要性和方法。因此，它提供了对用户查询的完整答案。(2/2) [结果] 4/4
1	baseline	该调查如何对大型语言模型 (LLMs) 的评估进行分类，以及提到了哪三个主要类别？	无	['Question \nAnsweringTool \nLearning\nReasoning\nKnowledge \nCompletionEthics \nand \nMorality Bias\nToxicity\nTruthfulnessRobustnessEvaluation\nRisk \nEvaluation\nBiology and \nMedicine\nEducationLegislationComputer \nScienceFinance\nBenchmarks for\nHolistic Evaluation\nBenchmarks \nforKnowledge and Reasoning\nBenchmarks \nforNLU and NLGKnowledge and Capability\nLarge Language \nModel EvaluationAlignment Evaluation\nSafety\nSpecialized LLMs\nEvaluation Organization\n…Figure 1: Our proposed taxonomy of major categories and sub-categories of LLM evaluation.\nOur survey expands the scope to synthesize findings from both capability and alignment\nevaluations of LLMs. By complementing these previous surveys through an integrated\nperspective and expanded scope, our work provides a comprehensive overview of the current\nstate of LLM evaluation research. The distinctions between our survey and these two related\nworks further highlight the novel contributions of our study to the literature.\n2 Taxonomy and Roadmap\nThe primary objective of this survey is to meticulously categorize the evaluation of LLMs,\nfurnishing readers with a well-structured taxonomy framework. Through this framework,\nreaders can gain a nuanced understanding of LLMs’ performance and the attendant challenges\nacross diverse and pivotal domains.\nNumerous studies posit that the bedrock of LLMs’ capabilities resides in knowledge and\nreasoning, serving as the underpinning for their exceptional performance across a myriad of\ntasks. Nonetheless, the effective application of these capabilities necessitates a meticulous\nexamination of alignment concerns to ensure that the model’s outputs remain consistent with\nuser expectations. Moreover, the vulnerability of LLMs to malicious exploits or inadvertent\nmisuse underscores the imperative nature of safety considerations. Once alignment and safety\nconcerns have been addressed, LLMs can be judiciously deployed within specialized domains,\ncatalyzing task automation and facilitating intelligent decision-making. Thus, our overarching\n6', 'This survey systematically elaborates on the core capabilities of LLMs, encompassing critical\naspects like knowledge and reasoning. Furthermore, we delve into alignment evaluation and\nsafety evaluation, including ethical concerns, biases, toxicity, and truthfulness, to ensure the\nsafe, trustworthy and ethical application of LLMs. Simultaneously, we explore the potential\napplications of LLMs across diverse domains, including biology, education, law, computer\nscience, and finance. Most importantly, we provide a range of popular benchmark evaluations\nto assist researchers, developers and practitioners in understanding and evaluating LLMs’\nperformance.\nWe anticipate that this survey would drive the development of LLMs evaluations, offering\nclear guidance to steer the controlled advancement of these models. This will enable LLMs\nto better serve the community and the world, ensuring their applications in various domains\nare safe, reliable, and beneficial. With eager anticipation, we embrace the future challenges\nof LLMs’ development and evaluation.\n58']	0.375000	1. 检索到的上下文确实与用户查询的主题相符。用户查询是关于一项调查如何对大型语言模型 (LLMs) 的评估进行分类以及提及的三个主要类别。提供的上下文讨论了调查中 LLMs 评估的分类，提到了知识和能力评估、对齐评估和安全性评估等方面，以及跨不同领域的潜在应用。2. 然而，上下文没有提供用户查询的完整答案。虽然它确实讨论了 LLMs 评估的分类，但它没有明确提及三个主要类别。上下文提到了 LLMs 评估的几个方面，但尚不清楚其中哪些被认为是三个主要类别。[结果] 1.5

当然，你可以应用任何你喜欢的过滤器。例如，如果你想查看那些结果不完美（得分低于满分）的例子。

In [ ]

已复制!

cond = deep_dfs["context_relevancy"]["scores"] < 1
displayify_df(deep_dfs["context_relevancy"][cond].head(5))
displayify_df(deep_dfs["context_relevancy"][cond].head(5))

	rag	query	answer	contexts	scores	feedbacks
1	baseline	该调查如何对大型语言模型 (LLMs) 的评估进行分类，以及提到了哪三个主要类别？	无	['Question \nAnsweringTool \nLearning\nReasoning\nKnowledge \nCompletionEthics \nand \nMorality Bias\nToxicity\nTruthfulnessRobustnessEvaluation\nRisk \nEvaluation\nBiology and \nMedicine\nEducationLegislationComputer \nScienceFinance\nBenchmarks for\nHolistic Evaluation\nBenchmarks \nforKnowledge and Reasoning\nBenchmarks \nforNLU and NLGKnowledge and Capability\nLarge Language \nModel EvaluationAlignment Evaluation\nSafety\nSpecialized LLMs\nEvaluation Organization\n…Figure 1: Our proposed taxonomy of major categories and sub-categories of LLM evaluation.\nOur survey expands the scope to synthesize findings from both capability and alignment\nevaluations of LLMs. By complementing these previous surveys through an integrated\nperspective and expanded scope, our work provides a comprehensive overview of the current\nstate of LLM evaluation research. The distinctions between our survey and these two related\nworks further highlight the novel contributions of our study to the literature.\n2 Taxonomy and Roadmap\nThe primary objective of this survey is to meticulously categorize the evaluation of LLMs,\nfurnishing readers with a well-structured taxonomy framework. Through this framework,\nreaders can gain a nuanced understanding of LLMs’ performance and the attendant challenges\nacross diverse and pivotal domains.\nNumerous studies posit that the bedrock of LLMs’ capabilities resides in knowledge and\nreasoning, serving as the underpinning for their exceptional performance across a myriad of\ntasks. Nonetheless, the effective application of these capabilities necessitates a meticulous\nexamination of alignment concerns to ensure that the model’s outputs remain consistent with\nuser expectations. Moreover, the vulnerability of LLMs to malicious exploits or inadvertent\nmisuse underscores the imperative nature of safety considerations. Once alignment and safety\nconcerns have been addressed, LLMs can be judiciously deployed within specialized domains,\ncatalyzing task automation and facilitating intelligent decision-making. Thus, our overarching\n6', 'This survey systematically elaborates on the core capabilities of LLMs, encompassing critical\naspects like knowledge and reasoning. Furthermore, we delve into alignment evaluation and\nsafety evaluation, including ethical concerns, biases, toxicity, and truthfulness, to ensure the\nsafe, trustworthy and ethical application of LLMs. Simultaneously, we explore the potential\napplications of LLMs across diverse domains, including biology, education, law, computer\nscience, and finance. Most importantly, we provide a range of popular benchmark evaluations\nto assist researchers, developers and practitioners in understanding and evaluating LLMs’\nperformance.\nWe anticipate that this survey would drive the development of LLMs evaluations, offering\nclear guidance to steer the controlled advancement of these models. This will enable LLMs\nto better serve the community and the world, ensuring their applications in various domains\nare safe, reliable, and beneficial. With eager anticipation, we embrace the future challenges\nof LLMs’ development and evaluation.\n58']	0.375000	1. 检索到的上下文确实与用户查询的主题相符。用户查询是关于一项调查如何对大型语言模型 (LLMs) 的评估进行分类以及提及的三个主要类别。提供的上下文讨论了调查中 LLMs 评估的分类，提到了知识和能力评估、对齐评估和安全性评估等方面，以及跨不同领域的潜在应用。2. 然而，上下文没有提供用户查询的完整答案。虽然它确实讨论了 LLMs 评估的分类，但它没有明确提及三个主要类别。上下文提到了 LLMs 评估的几个方面，但尚不清楚其中哪些被认为是三个主要类别。[结果] 1.5
9	baseline	这项关于 LLM 评估的调查与 Chang 等人 (2023) 和 Liu 等人 (2023i) 先前进行的综述有何不同？	无	['This survey systematically elaborates on the core capabilities of LLMs, encompassing critical\naspects like knowledge and reasoning. Furthermore, we delve into alignment evaluation and\nsafety evaluation, including ethical concerns, biases, toxicity, and truthfulness, to ensure the\nsafe, trustworthy and ethical application of LLMs. Simultaneously, we explore the potential\napplications of LLMs across diverse domains, including biology, education, law, computer\nscience, and finance. Most importantly, we provide a range of popular benchmark evaluations\nto assist researchers, developers and practitioners in understanding and evaluating LLMs’\nperformance.\nWe anticipate that this survey would drive the development of LLMs evaluations, offering\nclear guidance to steer the controlled advancement of these models. This will enable LLMs\nto better serve the community and the world, ensuring their applications in various domains\nare safe, reliable, and beneficial. With eager anticipation, we embrace the future challenges\nof LLMs’ development and evaluation.\n58', '(2021)\nBEGIN (Dziri et al., 2022b)\nConsisTest (Lotfi et al., 2022)\nSummarizationXSumFaith (Maynez et al., 2020)\nFactCC (Kryscinski et al., 2020)\nSummEval (Fabbri et al., 2021)\nFRANK (Pagnoni et al., 2021)\nSummaC (Laban et al., 2022)\nWang et al. (2020)\nGoyal & Durrett (2021)\nCao et al. (2022)\nCLIFF (Cao & Wang, 2021)\nAggreFact (Tang et al., 2023a)\nPolyTope (Huang et al., 2020)\nMethodsNLI-based MethodsWelleck et al. (2019)\nLotfi et al. (2022)\nFalke et al. (2019)\nLaban et al. (2022)\nMaynez et al. (2020)\nAharoni et al. (2022)\nUtama et al. (2022)\nRoit et al. (2023)\nQAQG-based MethodsFEQA (Durmus et al., 2020)\nQAGS (Wang et al., 2020)\nQuestEval (Scialom et al., 2021)\nQAFactEval (Fabbri et al., 2022)\nQ2 (Honovich et al., 2021)\nFaithDial (Dziri et al., 2022a)\nDeng et al. (2023b)\nLLMs-based MethodsFIB (Tam et al., 2023)\nFacTool (Chern et al., 2023)\nFActScore (Min et al., 2023)\nSelfCheckGPT (Manakul et al., 2023)\nSAPLMA (Azaria & Mitchell, 2023)\nLin et al. (2022b)\nKadavath et al. (2022)\nFigure 3: Overview of alignment evaluations.\n4 Alignment Evaluation\nAlthough instruction-tuned LLMs exhibit impressive capabilities, these aligned LLMs are\nstill suffering from annotators’ biases, catering to humans, hallucination, etc. To provide a\ncomprehensive view of LLMs’ alignment evaluation, in this section, we discuss those of ethics,\nbias, toxicity, and truthfulness, as illustrated in Figure 3.\n21']	0.000000	1. 检索到的上下文与用户查询的主题不符。用户查询要求比较当前关于 LLM 评估的调查与 Chang 等人 (2023) 和 Liu 等人 (2023i) 先前进行的综述。然而，上下文完全没有提到这些先前的综述，因此无法进行任何比较。因此，上下文与用户查询的主题不符。(0/2) 2. 检索到的上下文不能单独提供用户查询的完整答案。如上所述，上下文没有提到 Chang 等人和 Liu 等人先前的综述，而这是用户查询的主要重点。因此，它不能提供用户查询的完整答案。(0/2) [结果] 0.0
11	baseline	根据文档，在将大型语言模型 (LLMs) 部署到专业领域之前需要解决哪两个主要问题？	无	['This survey systematically elaborates on the core capabilities of LLMs, encompassing critical\naspects like knowledge and reasoning. Furthermore, we delve into alignment evaluation and\nsafety evaluation, including ethical concerns, biases, toxicity, and truthfulness, to ensure the\nsafe, trustworthy and ethical application of LLMs. Simultaneously, we explore the potential\napplications of LLMs across diverse domains, including biology, education, law, computer\nscience, and finance. Most importantly, we provide a range of popular benchmark evaluations\nto assist researchers, developers and practitioners in understanding and evaluating LLMs’\nperformance.\nWe anticipate that this survey would drive the development of LLMs evaluations, offering\nclear guidance to steer the controlled advancement of these models. This will enable LLMs\nto better serve the community and the world, ensuring their applications in various domains\nare safe, reliable, and beneficial. With eager anticipation, we embrace the future challenges\nof LLMs’ development and evaluation.\n58', 'objective is to delve into evaluations encompassing these five fundamental domains and their\nrespective subdomains, as illustrated in Figure 1.\nSection 3, titled “Knowledge and Capability Evaluation”, centers on the comprehensive\nassessment of the fundamental knowledge and reasoning capabilities exhibited by LLMs. This\nsection is meticulously divided into four distinct subsections: Question-Answering, Knowledge\nCompletion, Reasoning, and Tool Learning. Question-answering and knowledge completion\ntasks stand as quintessential assessments for gauging the practical application of knowledge,\nwhile the various reasoning tasks serve as a litmus test for probing the meta-reasoning and\nintricate reasoning competencies of LLMs. Furthermore, the recently emphasized special\nability of tool learning is spotlighted, showcasing its significance in empowering models to\nadeptly handle and generate domain-specific content.\nSection 4, designated as “Alignment Evaluation”, hones in on the scrutiny of LLMs’ perfor-\nmance across critical dimensions, encompassing ethical considerations, moral implications,\nbias detection, toxicity assessment, and truthfulness evaluation. The pivotal aim here is to\nscrutinize and mitigate the potential risks that may emerge in the realms of ethics, bias,\nand toxicity, as LLMs can inadvertently generate discriminatory, biased, or offensive content.\nFurthermore, this section acknowledges the phenomenon of hallucinations within LLMs, which\ncan lead to the inadvertent dissemination of false information. As such, an indispensable\nfacet of this evaluation involves the rigorous assessment of truthfulness, underscoring its\nsignificance as an essential aspect to evaluate and rectify.\nSection 5, titled “Safety Evaluation”, embarks on a comprehensive exploration of two funda-\nmental dimensions: the robustness of LLMs and their evaluation in the context of Artificial\nGeneral Intelligence (AGI). LLMs are routinely deployed in real-world scenarios, where their\nrobustness becomes paramount. Robustness equips them to navigate disturbances stemming\nfrom users and the environment, while also shielding against malicious attacks and deception,\nthereby ensuring consistent high-level performance. Furthermore, as LLMs inexorably ad-\nvance toward human-level capabilities, the evaluation expands its purview to encompass more\nprofound security concerns. These include but are not limited to power-seeking behaviors\nand the development of situational awareness, factors that necessitate meticulous evaluation\nto safeguard against unforeseen challenges.\nSection 6, titled “Specialized LLMs Evaluation”, serves as an extension of LLMs evaluation\nparadigm into diverse specialized domains. Within this section, we turn our attention to the\nevaluation of LLMs specifically tailored for application in distinct domains. Our selection\nencompasses currently prominent specialized LLMs spanning fields such as biology, education,\nlaw, computer science, and finance. The objective here is to systematically assess their\naptitude and limitations when confronted with domain-specific challenges and intricacies.\nSection 7, denominated “Evaluation Organization”, serves as a comprehensive introduction\nto the prevalent benchmarks and methodologies employed in the evaluation of LLMs. In light\nof the rapid proliferation of LLMs, users are confronted with the challenge of identifying the\nmost apt models to meet their specific requirements while minimizing the scope of evaluations.\nIn this context, we present an overview of well-established and widely recognized benchmark\n7']	0.750000	检索到的上下文与用户查询的主题相符。它讨论了在专业领域部署 LLMs 之前需要解决的问题。提到的两个主要问题是对齐评估，其中包括伦理考量、道德影响、偏见检测、毒性评估和真实性评估，以及安全性评估，其中包括 LLMs 的鲁棒性及其在通用人工智能 (AGI) 背景下的评估。然而，上下文没有提供用户查询的完整答案。虽然它提到了两个主要问题，但没有详细说明为什么在专业领域部署 LLMs 之前需要解决这些问题。上下文提供了这些问题的概述，但没有明确地将这些问题与在专业领域部署 LLMs 联系起来。[结果] 3.0
12	baseline	在“对齐评估”部分，为了减轻与大型语言模型 (LLMs) 相关的潜在风险，评估了哪些维度？	无	['This survey systematically elaborates on the core capabilities of LLMs, encompassing critical\naspects like knowledge and reasoning. Furthermore, we delve into alignment evaluation and\nsafety evaluation, including ethical concerns, biases, toxicity, and truthfulness, to ensure the\nsafe, trustworthy and ethical application of LLMs. Simultaneously, we explore the potential\napplications of LLMs across diverse domains, including biology, education, law, computer\nscience, and finance. Most importantly, we provide a range of popular benchmark evaluations\nto assist researchers, developers and practitioners in understanding and evaluating LLMs’\nperformance.\nWe anticipate that this survey would drive the development of LLMs evaluations, offering\nclear guidance to steer the controlled advancement of these models. This will enable LLMs\nto better serve the community and the world, ensuring their applications in various domains\nare safe, reliable, and beneficial. With eager anticipation, we embrace the future challenges\nof LLMs’ development and evaluation.\n58', 'Question \nAnsweringTool \nLearning\nReasoning\nKnowledge \nCompletionEthics \nand \nMorality Bias\nToxicity\nTruthfulnessRobustnessEvaluation\nRisk \nEvaluation\nBiology and \nMedicine\nEducationLegislationComputer \nScienceFinance\nBenchmarks for\nHolistic Evaluation\nBenchmarks \nforKnowledge and Reasoning\nBenchmarks \nforNLU and NLGKnowledge and Capability\nLarge Language \nModel EvaluationAlignment Evaluation\nSafety\nSpecialized LLMs\nEvaluation Organization\n…Figure 1: Our proposed taxonomy of major categories and sub-categories of LLM evaluation.\nOur survey expands the scope to synthesize findings from both capability and alignment\nevaluations of LLMs. By complementing these previous surveys through an integrated\nperspective and expanded scope, our work provides a comprehensive overview of the current\nstate of LLM evaluation research. The distinctions between our survey and these two related\nworks further highlight the novel contributions of our study to the literature.\n2 Taxonomy and Roadmap\nThe primary objective of this survey is to meticulously categorize the evaluation of LLMs,\nfurnishing readers with a well-structured taxonomy framework. Through this framework,\nreaders can gain a nuanced understanding of LLMs’ performance and the attendant challenges\nacross diverse and pivotal domains.\nNumerous studies posit that the bedrock of LLMs’ capabilities resides in knowledge and\nreasoning, serving as the underpinning for their exceptional performance across a myriad of\ntasks. Nonetheless, the effective application of these capabilities necessitates a meticulous\nexamination of alignment concerns to ensure that the model’s outputs remain consistent with\nuser expectations. Moreover, the vulnerability of LLMs to malicious exploits or inadvertent\nmisuse underscores the imperative nature of safety considerations. Once alignment and safety\nconcerns have been addressed, LLMs can be judiciously deployed within specialized domains,\ncatalyzing task automation and facilitating intelligent decision-making. Thus, our overarching\n6']	0.750000	1. 检索到的上下文确实与用户查询的主题相符。用户查询是关于在“对齐评估”部分评估哪些维度以减轻与大型语言模型 (LLMs) 相关的潜在风险。上下文讨论了 LLMs 的评估，包括对齐评估和安全性评估。它提到了知识和能力、伦理问题、偏见、毒性和真实性等方面。这些是在对齐评估中可能评估的一些维度，以减轻与 LLMs 相关的潜在风险。因此，上下文与查询相关。(2/2) 2. 然而，检索到的上下文没有提供用户查询的完整答案。虽然它提到了对齐评估中可能评估的一些维度（如知识和能力、伦理问题、偏见、毒性和真实性），但它没有明确说明这些就是用于减轻与 LLMs 相关的潜在风险的维度。上下文没有提供全面的维度列表，也没有解释这些维度如何帮助减轻风险。因此，该上下文不能单独用于提供用户查询的完整答案。(1/2) [结果] 3.0
14	baseline	评估大型语言模型 (LLMs) 的知识和能力有什么目的？	无	['objective is to delve into evaluations encompassing these five fundamental domains and their\nrespective subdomains, as illustrated in Figure 1.\nSection 3, titled “Knowledge and Capability Evaluation”, centers on the comprehensive\nassessment of the fundamental knowledge and reasoning capabilities exhibited by LLMs. This\nsection is meticulously divided into four distinct subsections: Question-Answering, Knowledge\nCompletion, Reasoning, and Tool Learning. Question-answering and knowledge completion\ntasks stand as quintessential assessments for gauging the practical application of knowledge,\nwhile the various reasoning tasks serve as a litmus test for probing the meta-reasoning and\nintricate reasoning competencies of LLMs. Furthermore, the recently emphasized special\nability of tool learning is spotlighted, showcasing its significance in empowering models to\nadeptly handle and generate domain-specific content.\nSection 4, designated as “Alignment Evaluation”, hones in on the scrutiny of LLMs’ perfor-\nmance across critical dimensions, encompassing ethical considerations, moral implications,\nbias detection, toxicity assessment, and truthfulness evaluation. The pivotal aim here is to\nscrutinize and mitigate the potential risks that may emerge in the realms of ethics, bias,\nand toxicity, as LLMs can inadvertently generate discriminatory, biased, or offensive content.\nFurthermore, this section acknowledges the phenomenon of hallucinations within LLMs, which\ncan lead to the inadvertent dissemination of false information. As such, an indispensable\nfacet of this evaluation involves the rigorous assessment of truthfulness, underscoring its\nsignificance as an essential aspect to evaluate and rectify.\nSection 5, titled “Safety Evaluation”, embarks on a comprehensive exploration of two funda-\nmental dimensions: the robustness of LLMs and their evaluation in the context of Artificial\nGeneral Intelligence (AGI). LLMs are routinely deployed in real-world scenarios, where their\nrobustness becomes paramount. Robustness equips them to navigate disturbances stemming\nfrom users and the environment, while also shielding against malicious attacks and deception,\nthereby ensuring consistent high-level performance. Furthermore, as LLMs inexorably ad-\nvance toward human-level capabilities, the evaluation expands its purview to encompass more\nprofound security concerns. These include but are not limited to power-seeking behaviors\nand the development of situational awareness, factors that necessitate meticulous evaluation\nto safeguard against unforeseen challenges.\nSection 6, titled “Specialized LLMs Evaluation”, serves as an extension of LLMs evaluation\nparadigm into diverse specialized domains. Within this section, we turn our attention to the\nevaluation of LLMs specifically tailored for application in distinct domains. Our selection\nencompasses currently prominent specialized LLMs spanning fields such as biology, education,\nlaw, computer science, and finance. The objective here is to systematically assess their\naptitude and limitations when confronted with domain-specific challenges and intricacies.\nSection 7, denominated “Evaluation Organization”, serves as a comprehensive introduction\nto the prevalent benchmarks and methodologies employed in the evaluation of LLMs. In light\nof the rapid proliferation of LLMs, users are confronted with the challenge of identifying the\nmost apt models to meet their specific requirements while minimizing the scope of evaluations.\nIn this context, we present an overview of well-established and widely recognized benchmark\n7', 'evaluations. This serves the purpose of aiding users in making judicious and well-informed\ndecisions when selecting an appropriate LLM for their particular needs.\nPleasebeawarethatourtaxonomyframeworkdoesnotpurporttocomprehensivelyencompass\nthe entirety of the evaluation landscape. In essence, our aim is to address the following\nfundamental questions:\n•What are the capabilities of LLMs?\n•What factors must be taken into account when deploying LLMs?\n•In which domains can LLMs find practical applications?\n•How do LLMs perform in these diverse domains?\nWe will now embark on an in-depth exploration of each category within the LLM evaluation\ntaxonomy, sequentially addressing capabilities, concerns, applications, and performance.\n3 Knowledge and Capability Evaluation\nEvaluating the knowledge and capability of LLMs has become an important research area as\nthese models grow in scale and capability. As LLMs are deployed in more applications, it is\ncrucial to rigorously assess their strengths and limitations across a diverse range of tasks and\ndatasets. In this section, we aim to offer a comprehensive overview of the evaluation methods\nand benchmarks pertinent to LLMs, spanning various capabilities such as question answering,\nknowledge completion, reasoning, and tool use. Our objective is to provide an exhaustive\nsynthesis of the current advancements in the systematic evaluation and benchmarking of\nLLMs’ knowledge and capabilities, as illustrated in Figure 2.\n3.1 Question Answering\nQuestionansweringisaveryimportantmeansforLLMsevaluation, andthequestionanswering\nability of LLMs directly determines whether the final output can meet the expectation. At\nthe same time, however, since any form of LLMs evaluation can be regarded as question\nanswering or transfer to question answering form, there are rare datasets and works that\npurely evaluate question answering ability of LLMs. Most of the datasets are curated to\nevaluate other capabilities of LLMs.\nTherefore, we believe that the datasets simply used to evaluate the question answering ability\nof LLMs must be from a wide range of sources, preferably covering all fields rather than\naiming at some fields, and the questions do not need to be very professional but general.\nAccording to the above criteria for datasets focusing on question answering capability, we can\nfind that many datasets are qualified, e.g., SQuAD (Rajpurkar et al., 2016), NarrativeQA\n(Kociský et al., 2018), HotpotQA (Yang et al., 2018), CoQA (Reddy et al., 2019). Although\nthese datasets predate LLMs, they can still be used to evaluate the question answering ability\nof LLMs. Kwiatkowski et al. (2019) present the Natural Questions corpus. The questions\n8']	0.750000	检索到的上下文与用户查询相关，因为它讨论了评估大型语言模型 (LLMs) 的知识和能力的目的。它解释说，评估对于评估它们在各种任务和数据集中的优势和局限性非常重要。上下文还提到了评估 LLMs 的不同方面，例如问答、知识完成、推理和工具使用。然而，上下文没有完全回答用户查询。虽然它提供了关于为什么评估 LLMs 的总体概念，但没有深入探讨这些评估的具体目的。例如，它没有解释这些评估如何帮助提高 LLMs 的性能，或者如何用于识别 LLMs 可能需要进一步开发或训练的领域。[结果] 3.0

答案相关性和上下文相关性评估¶

下载数据集 (LabelledRagDataset)¶

分别评估答案和上下文相关性¶

查看评估结果¶

下载数据集 (`LabelledRagDataset`)¶