%pip install llama-index-readers-wikipedia
%pip install llama-index-finetuning
%pip install llama-index-llms-openai
%pip install llama-index-llms-mistralai
%pip install llama-index-llms-huggingface-api
# NOTE: this notebook makes several API calls to generate text with OpenAI GPT
# models as well as models hosted on HuggingFace. If you prefer not to wait for
# these generations, then the data for this notebook can be obtained with the
# `wget` command provided below.
# !wget "https://www.dropbox.com/scl/fo/m7skpjdbpb0g3p76y6epe/h?rlkey=omh2ysgh9qqqztf81qvjlivu2&dl=1" -O pairwise.zip
import nest_asyncio
nest_asyncio.apply()
import os
# we will be using models on HuggingFace as our LLM answer generators
HUGGING_FACE_TOKEN = os.getenv("HUGGING_FACE_TOKEN")
# we will use GPT-4 and GPT-3.5 + OpenAI Fine-Tuning
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
import pandas as pd
# define jupyter display function
def display_eval_df(question, source, answer_a, answer_b, result) -> None:
"""Pretty print question/answer + gpt-4 judgement dataset."""
eval_df = pd.DataFrame(
{
"Question": question,
"Source": source,
"Model A": answer_a["model"],
"Answer A": answer_a["text"],
"Model B": answer_b["model"],
"Answer B": answer_b["text"],
"Score": result.score,
"Judgement": result.feedback,
},
index=[0],
)
eval_df = eval_df.style.set_properties(
**{
"inline-size": "300px",
"overflow-wrap": "break-word",
},
subset=["Answer A", "Answer B"]
)
display(eval_df)
步骤 1:生成数据集:train_dataset
和 test_dataset
¶
对于我们将生成问题并提示各种 LLM 回答的数据集,我们将使用 WikipediaReader
读取几个城市的“历史”。train_dataset
,另一个用于 test_dataset
。
!pip install wikipedia -q
[notice] A new release of pip is available: 23.2.1 -> 23.3.1 [notice] To update, run: pip install --upgrade pip
# wikipedia pages
from llama_index.readers.wikipedia import WikipediaReader
train_cities = [
"San Francisco",
"Toronto",
"New York City",
"Vancouver",
"Montreal",
"Boston",
]
test_cities = [
"Tokyo",
"Singapore",
"Paris",
]
train_documents = WikipediaReader().load_data(
pages=[f"History of {x}" for x in train_cities]
)
test_documents = WikipediaReader().load_data(
pages=[f"History of {x}" for x in test_cities]
)
使用 DatasetGenerator
构建 train_dataset
和 test_dataset
¶
现在我们有了训练集和测试集 Document
,下一步是生成问题。为此,我们将使用 DatasetGenerator
,它使用 LLM 从给定的文档集生成问题。
生成问题¶
QUESTION_GEN_PROMPT = (
"You are a Teacher/ Professor. Your task is to setup "
"a quiz/examination. Using the provided context, formulate "
"a single question that captures an important fact from the "
"context. Restrict the question to the context information provided."
)
说完这些,让我们开始行动吧。首先,我们将下载参考 PDF 文档并根据它创建问题集。
# generate questions against chunks
from llama_index.core.evaluation import DatasetGenerator
from llama_index.llms.openai import OpenAI
llm = OpenAI(model="gpt-3.5-turbo", temperature=0.3)
# instantiate DatasetGenerator's for train and test
train_dataset_generator = DatasetGenerator.from_documents(
train_documents,
question_gen_query=QUESTION_GEN_PROMPT,
llm=llm,
show_progress=True,
num_questions_per_chunk=25,
)
test_dataset_generator = DatasetGenerator.from_documents(
test_documents,
question_gen_query=QUESTION_GEN_PROMPT,
llm=llm,
show_progress=True,
num_questions_per_chunk=25,
)
# use DatasetGenerator to create questions from nodes
train_questions = train_dataset_generator.generate_questions_from_nodes(
num=200
)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 75/75 [00:02<00:00, 36.34it/s]
test_questions = test_dataset_generator.generate_questions_from_nodes(num=150)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 64/64 [00:02<00:00, 29.98it/s]
len(train_questions), len(test_questions)
(75, 64)
# let's take a look at a few of these
train_questions[:3]
['What event in 1906 caused significant damage to San Francisco but was followed by a quick rebuild?', 'What was the name of the first significant homestead established outside the immediate vicinity of Mission Dolores in San Francisco?', "What event in 1855 led to the establishment of San Francisco's first county hospital and the development of California's system of county hospitals for the poor?"]
test_questions[:3]
['Question: What was the name of the oldest Buddhist temple in Tokyo, founded in 628?', 'What event marked the end of the samurai system and feudal class divisions in Tokyo?', 'Question: What role did the Tokyo Imperial University play in the Meiji Era?']
生成问题的答案¶
下一步是使用 LLM 生成答案。提醒一下,这里的重点是评估这些生成的答案。因此稍后,我们将使用 GPT 模型来评估这些答案。
但对于问题的答案生成,我们将使用另外两个 LLM,即:Llama-2 和 Mistral。为此,我们首先为我们的文档创建一个向量存储和一个关联的检索器,这两个 LLM 答案生成器都将使用它们。
from llama_index.core import VectorStoreIndex
from llama_index.core.retrievers import VectorIndexRetriever
# Create vector index
train_index = VectorStoreIndex.from_documents(documents=train_documents)
# Create the retriver on this index
train_retriever = VectorIndexRetriever(
index=train_index,
similarity_top_k=2,
)
# Create vector index for test to be used later
test_index = VectorStoreIndex.from_documents(documents=test_documents)
# Create the retriver for test to be used later
test_retriever = VectorIndexRetriever(
index=test_index,
similarity_top_k=2,
)
从这里我们将构建 RetrieverQueryEngine
,它们将接收我们的查询(即问题)进行处理。请注意,我们使用 HuggingFaceInferenceAPI
作为我们的 LLM 答案生成器,并且 Llama-2 需要权限。如果您尚未获得这些模型的访问权限,请随意将 Llama-2 替换为您选择的其他模型。
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI
def create_query_engine(
hf_name: str, retriever: VectorIndexRetriever, hf_llm_generators: dict
) -> RetrieverQueryEngine:
"""Create a RetrieverQueryEngine using the HuggingFaceInferenceAPI LLM"""
if hf_name not in hf_llm_generators:
raise KeyError("model not listed in hf_llm_generators")
llm = HuggingFaceInferenceAPI(
model_name=hf_llm_generators[hf_name],
context_window=2048, # to use refine
token=HUGGING_FACE_TOKEN,
)
return RetrieverQueryEngine.from_args(retriever=retriever, llm=llm)
# define our llm-generators (query_engines)
hf_llm_generators = {
"mistral-7b-instruct": "mistralai/Mistral-7B-Instruct-v0.1",
"llama2-7b-chat": "meta-llama/Llama-2-7b-chat-hf",
}
train_query_engines = {
mdl: create_query_engine(mdl, train_retriever, hf_llm_generators)
for mdl in hf_llm_generators.keys()
}
test_query_engines = {
mdl: create_query_engine(mdl, test_retriever, hf_llm_generators)
for mdl in hf_llm_generators.keys()
}
我们现在准备好从各种 LLM 生成答案了。我们现在就为 train_dataset
执行此操作,并推迟为 test_dataset
执行此操作,直到需要使用它的时候。
注意:生成这些内容需要一些时间。如果您不想等待,可以选择加载包含每个问题的 Llama-2 和 Mistral 答案的 train_qa.jsonl
文件。
import tqdm
import random
train_dataset = []
for q in tqdm.tqdm(train_questions):
# randomly select two LLMs to generate answers to this q
model_versus = random.sample(list(train_query_engines.items()), 2)
# data for this q
data_entry = {"question": q}
responses = []
source = None
# generate answers
for name, engine in model_versus:
response = engine.query(q)
response_struct = {}
response_struct["model"] = name
response_struct["text"] = str(response)
if source is not None:
assert source == response.source_nodes[0].node.text[:1000] + "..."
else:
source = response.source_nodes[0].node.text[:1000] + "..."
responses.append(response_struct)
data_entry["answers"] = responses
data_entry["source"] = source
train_dataset.append(data_entry)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 75/75 [07:40<00:00, 6.14s/it]
获取 GPT-4 对 Mistral 和 Llama-2 答案的评估¶
如前所述,本指南的目的是从 GPT-4 评估器微调 LLM 评估器。因此,为了完善我们的 train_dataset
,我们现在需要实例化我们的 GPT-4 评估器,并让它评估由其他 LLM 提供的答案:Llama-2 和 Mistral。为此,我们将使用 PairwiseComparisonEvaluator
类。这个评估器将比较这两个答案,并判断 Llama-2 的答案是否更好,Mistral 的答案是否更好,或者是否是平局。
这里有一些额外的细微之处,因为在成对评估中,我们必须注意潜在的“位置偏差”。这是指评估器偏爱首先呈现给它的答案(在提示/上下文中)。为了解决这种位置偏差,我们让 GPT-4 评估器对每个样本进行两次评估,在第二次评估中,我们调换两个答案的呈现顺序(即,第一次评估:Llama-2 然后 Mistral,第二次评估:Mistral 然后 Llama-2)。
最后,我们还使用了 OpenAIFineTuningHandler
,它将收集我们将最终需要用于微调 GPT-3.5 的所有聊天历史记录。
注意:生成这些判断需要一些时间。同样,您可以选择加载 train_qa.jsonl
作为 train_dataset
。此外,我们还存储了我们传递给 OpenAI 用于微调 GPT-3.5 的 JSONL 文件。
# instantiate the gpt-4 judge
from llama_index.llms.openai import OpenAI
from llama_index.finetuning.callbacks import OpenAIFineTuningHandler
from llama_index.core.callbacks import CallbackManager
from llama_index.core.evaluation import PairwiseComparisonEvaluator
from llama_index.core import Settings
# NOTE: this finetuning_handler will collect 2x chat_histories for
# each query: one for original, and another for flipped
main_finetuning_handler = OpenAIFineTuningHandler()
callback_manager = CallbackManager([main_finetuning_handler])
Settings.callback_manager = callback_manager
llm_4 = OpenAI(temperature=0, model="gpt-4", callback_manager=callback_manager)
gpt4_judge = PairwiseComparisonEvaluator(llm=llm_4)
for data_entry in tqdm.tqdm(train_dataset):
final_eval_result = await gpt4_judge.aevaluate(
query=data_entry["question"],
response=data_entry["answers"][0]["text"],
second_response=data_entry["answers"][1]["text"],
reference=data_entry["source"],
)
# save final result
judgement = {}
judgement["llm"] = "gpt_4"
judgement["score"] = final_eval_result.score
judgement["text"] = final_eval_result.response
judgement["source"] = final_eval_result.pairwise_source
data_entry["evaluations"] = [judgement]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 75/75 [48:04<00:00, 38.46s/it]
让我们看看其中一个 GPT-4 评估结果是什么样子的。
# let's see the last one
display_eval_df(
question=data_entry["question"],
source=data_entry["source"],
answer_a=data_entry["answers"][0],
answer_b=data_entry["answers"][1],
result=final_eval_result,
)
问题 | 来源 | 模型 A | 答案 A | 模型 B | 答案 B | 分数 | 判断结果 | |
---|---|---|---|---|---|---|---|---|
0 | 波士顿有哪些可用于研究和保存历史文物和遗迹的资源? | 1963 年,波士顿市长约翰·F·柯林斯和波士顿重建局 (BRA) 执行官爱德华·J·洛格组织了该市的储蓄银行、合作社以及联邦和州储蓄贷款协会联盟,称为波士顿银行城市更新小组 (B-BURG),该小组将沿着蓝山大道对多切斯特、罗克斯伯里和马塔潘的部分地区进行反向红线划分。尽管马萨诸塞州第 156 届立法机关于 1950 年通过了禁止住房中的种族歧视或隔离的立法,以及约翰·F·肯尼迪总统于 1962 年颁布了第 11063 号行政命令,要求所有联邦机构防止美国所有联邦资助的补贴住房中的种族歧视,但波士顿住房管理局 (BHA) 董事会在柯林斯政府期间也积极隔离了该市的公共住房开发项目,BHA 部门至少在 1966 年之前一直进行官僚抵制融合,董事会保留了共同... | llama2-7b-chat | 波士顿有多种可用于研究和保存历史文物和遗迹的资源。其中包括:1. 波士顿历史学会:该组织提供与波士顿历史相关的历史文物、照片和文档。2. Boston Mapjunction:该网站提供了 200 多张可追溯至 1630 年的波士顿历史地图,以及与当今地图对比的航空照片。3. 波士顿市考古项目和实验室:波士顿市设有一名城市考古学家,负责监督任何待开发土地的历史文物和重要性,并管理位于波士顿公共土地上的考古遗迹。该项目还有一个考古实验室以及教育和策展中心。4. Freedom House 照片集:该藏品包含 1950 年至 1975 年间罗克斯伯里的人物、地点和事件的 2,000 多张图像。5. 波士顿生命记录:该资源提供了与波士顿的出生、婚姻和死亡相关的历史记录。6. 阅读与日常生活在 | mistral-7b-instruct | 波士顿市设有一名城市考古学家,负责监督任何待开发土地的历史文物和重要性,并管理位于波士顿公共土地上的考古遗迹,还有一个城市考古项目和一个考古实验室、教育和策展中心。Freedom House 照片集包含 1950 年至 1975 年间罗克斯伯里人物、地点和事件的 2,000 多张图像(位于马萨诸塞州波士顿东北大学图书馆档案馆和特藏部)。 | 1.000000 | 助手 A 提供了更全面的答案,列出了波士顿可用于研究和保存历史文物和遗迹的多种资源。其中包括波士顿历史学会、Boston Mapjunction、波士顿市考古项目和实验室、Freedom House 照片集以及波士顿生命记录。这个答案更详细,并为用户提供了更广泛的资源可供探索。另一方面,助手 B 只提到了波士顿市考古项目和实验室以及 Freedom House 照片集。尽管这些都是相关的资源,但答案缺乏助手 A 回复的深度和多样性。因此,根据回复的深度、多样性和详细程度,助手 A 的答案更优越。最终裁决:[[A]] |
对微调 JSONL 的特别注意¶
由于有两个评估(一个用于 LLM 答案的原始呈现顺序,另一个用于翻转顺序),我们需要小心选择正确的评估保留在我们的微调数据集中。这意味着我们需要挑选出由 OpenAIFineTuningHandler
收集到的正确事件,然后仅使用这些事件来准备我们将传递给 OpenAI 微调 API 的 JSONL 文件。
main_finetuning_handler.save_finetuning_events(
"pairwise_finetuning_events.jsonl"
)
Wrote 150 examples to pairwise_finetuning_events.jsonl
import json
# Get the fine_tuning_examples master dataset
with open("pairwise_finetuning_events.jsonl") as f:
combined_finetuning_events = [json.loads(line) for line in f]
finetuning_events = (
[]
) # for storing events using original order of presentation
flipped_finetuning_events = (
[]
) # for storing events using flipped order of presentation
for ix, event in enumerate(combined_finetuning_events):
if ix % 2 == 0: # we always do original ordering first
finetuning_events += [event]
else: # then we flip order and have GPT-4 make another judgement
flipped_finetuning_events += [event]
assert len(finetuning_events) == len(flipped_finetuning_events)
# we need to pick which of the chat_histories to keep
resolved_finetuning_events = []
for ix, data_entry in enumerate(train_dataset):
if data_entry["evaluations"][0]["source"] == "original":
resolved_finetuning_events += [finetuning_events[ix]]
elif data_entry["evaluations"][0]["source"] == "flipped":
resolved_finetuning_events += [flipped_finetuning_events[ix]]
else:
continue
with open("resolved_pairwise_finetuning_events.jsonl", "w") as outfile:
for entry in resolved_finetuning_events:
print(json.dumps(entry), file=outfile)
步骤 2:执行知识蒸馏¶
好的,现在是时候将一些知识从 GPT-4 蒸馏到 GPT-3.5 了。为此,我们将使用 OpenAIFinetuneEngine
类以及我们刚刚创建的 resolved_pairwise_finetuning_events.jsonl
文件。
from llama_index.finetuning import OpenAIFinetuneEngine
finetune_engine = OpenAIFinetuneEngine(
"gpt-3.5-turbo",
"resolved_pairwise_finetuning_events.jsonl",
)
finetune_engine.finetune()
Num examples: 72 First example: {'role': 'system', 'content': "Please act as an impartial judge and evaluate the quality of the responses provided by two AI question-answering assistants to the user question perhaps with added reference which are displayed below. You should choose the assistant that follows the user’s instructions and answers the user’s question better using the provided context. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of their responses. Begin your evaluation by comparing the two responses and provide a short explanation. Avoid any position biases and ensure that the order in which the responses were presented does not influence your decision. Do not allow the length of the responses to influence your evaluation. Do not favor certain names of the assistants. Be as objective as possible. After providing your explanation, output your final verdict by strictly following this format: '[[A]]' if assistant A is better, '[[B]]' if assistant B is better, and '[[C]]' for a tie.\n"} {'role': 'user', 'content': "[User Question]\nWhat event in 1906 caused significant damage to San Francisco but was followed by a quick rebuild?\n\n[The Start of Reference]\n=== Reconstruction ===\nAlmost immediately after the quake re-planning and reconstruction plans were hatched to quickly rebuild the city. One of the more famous and ambitious plans, proposed before the fire, came from famed urban planner, Daniel Burnham. His bold plan called for Haussmann style avenues, boulevards, and arterial thoroughfares that radiated across the city, a massive civic center complex with classical structures, what would have been the largest urban park in the world, stretching from Twin Peaks to Lake Merced with a large athenaeum at its peak, and various other proposals. This plan was dismissed by critics (both at the time and now), as impractical and unrealistic to municipal supply and demand. Property owners and the Real Estate industry were against the idea as well due to the amounts of their land the city would have to purchase to realize such proposals. While the original street grid was restored, many of Burnham's proposals eventually saw the light of day such ...\n[The End of Reference]\n\n[The Start of Assistant A’s Answer]\n\n\nThe 1906 earthquake and fire caused significant damage to San Francisco, with over 80% of the city destroyed, including almost all of the downtown core. The earthquake, which had a magnitude of 7.8 on the Richter scale, resulted from the rupture of over 270 miles of the San Andreas Fault, centered immediately offshore of San Francisco. Water mains ruptured throughout the city, and the fires that followed burned out of control for days, trapping many residents and causing widespread destruction. Despite the devastation, the city quickly rebuilt, with many of the buildings constructed after the disaster being more earthquake-resistant than those that had come before. The official death toll from this event was initially reported as 478, but was later revised to over 3,000.\n[The End of Assistant A’s Answer]\n\n[The Start of Assistant B’s Answer]\n1906 earthquake and fire.\n[The End of Assistant B’s Answer]"} {'role': 'assistant', 'content': "Assistant A provides a detailed and comprehensive response to the user's question, explaining the event that caused significant damage to San Francisco in 1906, which was the earthquake and fire. It also provides additional information about the earthquake's magnitude, the extent of the damage, and the city's quick rebuilding efforts. On the other hand, Assistant B's response is very brief and only mentions the earthquake and fire, without providing any additional information or context. Therefore, Assistant A's response is more helpful, relevant, accurate, and detailed. \n\nFinal Verdict: [[A]]"} No errors found Num examples missing system message: 0 Num examples missing user message: 0 #### Distribution of num_messages_per_example: min / max: 3, 3 mean / median: 3.0, 3.0 p5 / p95: 3.0, 3.0 #### Distribution of num_total_tokens_per_example: min / max: 579, 1198 mean / median: 818.9305555555555, 772.0 p5 / p95: 625.9, 1076.0 #### Distribution of num_assistant_tokens_per_example: min / max: 66, 248 mean / median: 129.26388888888889, 117.5 p5 / p95: 81.0, 193.9 0 examples may be over the 4096 token limit, they will be truncated during fine-tuning Dataset has ~58963 tokens that will be charged for during training By default, you'll train for 3 epochs on this dataset By default, you'll be charged for ~176889 tokens As of August 22, 2023, fine-tuning gpt-3.5-turbo is $0.008 / 1K Tokens. This means your total cost for training will be $0.471704 per epoch.
# We can check the status of our current job as follows
# This may take some time ...
finetune_engine.get_current_job()
<FineTuningJob fine_tuning.job id=ftjob-jLxZggQbHz2F98IlhQEI9KIw at 0x2e6b91170> JSON: { "object": "fine_tuning.job", "id": "ftjob-jLxZggQbHz2F98IlhQEI9KIw", "model": "gpt-3.5-turbo-0613", "created_at": 1698817329, "finished_at": 1698817949, "fine_tuned_model": "ft:gpt-3.5-turbo-0613:llamaindex::8FyRSSOl", "organization_id": "org-1ZDAvajC6v2ZtAP9hLEIsXRz", "result_files": [ "file-qLTnxGSZX2rHP0Q7wJIDDNWX" ], "status": "succeeded", "validation_file": null, "training_file": "file-xsAaOBjQ949ti0qk1xHHLOiF", "hyperparameters": { "n_epochs": 3 }, "trained_tokens": 176457, "error": null }
3 评估微调后的 GPT-3.5 评估器在测试数据集上的表现¶
现在我们已经有了微调后的 GPT-3.5,让我们看看它在测试集上的表现如何。但首先,还记得我们说过要推迟创建 test_dataset
直到需要使用它的时候吗?好吧,现在就是时候了。所以我们在这里重复创建 train_dataset
的过程,但现在是为 test_dataset
。
注意:生成这些答案和评估需要一些时间。您可以选择加载包含来自所有三个 LLM 评估器的评估结果的 test_qa_complete.jsonl
文件。您可以将其加载为 test_dataset
并运行下面的指标小节中的代码。
import random
# Use Llama-2 and Mistral LLMs to generate the answers to the test queries
test_dataset = []
for q in tqdm.tqdm(test_questions):
# randomly select two LLMs to generate answers to this q
model_versus = random.sample(list(test_query_engines.items()), 2)
# data for this q
data_entry = {"question": q}
responses = []
source = None
# generate answers
for name, engine in model_versus:
response = engine.query(q)
response_struct = {}
response_struct["model"] = name
response_struct["text"] = str(response)
if source is not None:
assert source == response.source_nodes[0].node.text[:1000] + "..."
else:
source = response.source_nodes[0].node.text[:1000] + "..."
responses.append(response_struct)
data_entry["answers"] = responses
data_entry["source"] = source
test_dataset.append(data_entry)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 64/64 [28:23<00:00, 26.62s/it]
# get the gpt-4 judgments on the Mistal and Llama-2 answers
for data_entry in tqdm.tqdm(test_dataset):
final_eval_result = await gpt4_judge.aevaluate(
query=data_entry["question"],
response=data_entry["answers"][0]["text"],
second_response=data_entry["answers"][1]["text"],
reference=data_entry["source"],
)
# save final result
judgement = {}
judgement["llm"] = "gpt_4"
judgement["score"] = final_eval_result.score
judgement["text"] = final_eval_result.response
judgement["source"] = final_eval_result.pairwise_source
data_entry["evaluations"] = [judgement]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 64/64 [43:21<00:00, 40.66s/it]
from llama_index.core.evaluation import EvaluationResult
# use our fine-tuned GPT-3.5 to evaluate the answers
ft_llm = finetune_engine.get_finetuned_model()
ft_gpt_3p5_judge = PairwiseComparisonEvaluator(llm=ft_llm)
for data_entry in tqdm.tqdm(test_dataset):
try:
final_eval_result = await ft_gpt_3p5_judge.aevaluate(
query=data_entry["question"],
response=data_entry["answers"][0]["text"],
second_response=data_entry["answers"][1]["text"],
reference=data_entry["source"],
)
except:
final_eval_result = EvaluationResult(
query=data_entry["question"],
response="",
passing=None,
score=0.5,
feedback="",
pairwise_source="output-cannot-be-parsed",
)
# save final result
judgement = {}
judgement["llm"] = "ft_gpt_3p5"
judgement["score"] = final_eval_result.score
judgement["text"] = final_eval_result.response
judgement["source"] = final_eval_result.pairwise_source
data_entry["evaluations"] += [judgement]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 64/64 [04:08<00:00, 3.88s/it]
# Similarly, use a non-fine-tuned judge to evaluate the answers
gpt_3p5_llm = OpenAI(model="gpt-3.5-turbo")
gpt_3p5_judge = PairwiseComparisonEvaluator(llm=gpt_3p5_llm)
for data_entry in tqdm.tqdm(test_dataset):
try:
final_eval_result = await gpt_3p5_judge.aevaluate(
query=data_entry["question"],
response=data_entry["answers"][0]["text"],
second_response=data_entry["answers"][1]["text"],
reference=data_entry["source"],
)
except:
final_eval_result = EvaluationResult(
query=data_entry["question"],
response="",
passing=None,
score=0.5,
feedback="",
pairwise_source="output-cannot-be-parsed",
)
# save final result
judgement = {}
judgement["llm"] = "gpt_3p5"
judgement["score"] = final_eval_result.score
judgement["text"] = final_eval_result.response
judgement["source"] = final_eval_result.pairwise_source
data_entry["evaluations"] += [judgement]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 64/64 [09:32<00:00, 8.95s/it]
指标¶
呼!现在我们已经生成了所有 LLM 评估器对测试查询中 Llama-2/Mistral 答案的评估结果。现在让我们从定量角度看看微调后的 GPT-3.5 与 GPT-4 的接近程度。
为此,我们报告以下几个指标:
- 与 GPT-4 评估结果的一致率
- 与 GPT-4 评估结果的相关性
- 与 GPT-4 评估结果的 Jaccard 相似度
我们还报告了“不确定”计数,这是指 LLM 评估器在看到 Llama-2 和 Mistral 答案的翻转呈现顺序后改变其决定。较高的不确定计数表明 LLM 评估器容易受到位置偏差的影响,这是不好的!
!pip install scikit-learn -q
[notice] A new release of pip is available: 23.2.1 -> 23.3.1 [notice] To update, run: pip install --upgrade pip
import numpy as np
# store the scores and inconclusive booleans for each sample per LLM judge
scores = {"gpt_4": [], "gpt_3p5": [], "ft_gpt_3p5": []}
inconclusives = {"gpt_4": [], "gpt_3p5": [], "ft_gpt_3p5": []}
for ix, d in enumerate(test_dataset):
for e in d["evaluations"]:
scores[e["llm"]].append(e["score"])
inconclusives[e["llm"]].append(
e["source"] not in ["original", "flipped"]
)
REPORT_FMT_STR = (
"{model}\n"
"-----------------\n"
"Number of inconclusives: {inconclusive}\n"
"Number of agreements with GPT-4: {agreement} out of {total}\n"
"Agreement rate: {agreement_rate}\n"
"Correlation: {corr}\n"
"Jaccard: {jacc}\n\n"
)
from sklearn.metrics import jaccard_score
# numpy conversion
np_scores_gpt_4 = np.array(scores["gpt_4"])
np_scores_gpt_3p5 = np.array(scores["gpt_3p5"])
np_scores_ft_gpt_3p5 = np.array(scores["ft_gpt_3p5"])
# can only compare when both judges have non inconclusive results
ft_mask = ~np.array(inconclusives["gpt_4"]) * ~np.array(
inconclusives["ft_gpt_3p5"]
)
no_ft_mask = ~np.array(inconclusives["gpt_4"]) * ~np.array(
inconclusives["gpt_3p5"]
)
# agreement rates
agreement_ft = sum(np_scores_gpt_4[ft_mask] == np_scores_ft_gpt_3p5[ft_mask])
agreement_rate_ft = agreement_ft / sum(ft_mask)
agreement_no_ft = sum(
np_scores_gpt_4[no_ft_mask] == np_scores_gpt_3p5[no_ft_mask]
)
agreement_rate_no_ft = agreement_no_ft / sum(no_ft_mask)
# correlations
corr_ft = np.corrcoef(np_scores_gpt_4[ft_mask], np_scores_ft_gpt_3p5[ft_mask])[
0, 1
]
corr_no_ft = np.corrcoef(
np_scores_gpt_4[no_ft_mask], np_scores_gpt_3p5[no_ft_mask]
)[0, 1]
# jaccard
jaccard_ft = jaccard_score(
np_scores_gpt_4[ft_mask].astype(str),
np_scores_ft_gpt_3p5[ft_mask].astype(str),
average="weighted",
)
jaccard_no_ft = jaccard_score(
np_scores_gpt_4[no_ft_mask].astype(str),
np_scores_gpt_3p5[no_ft_mask].astype(str),
average="weighted",
)
print(
REPORT_FMT_STR.format(
model="GPT-3.5 w/ fine-tuning",
inconclusive=sum(inconclusives["ft_gpt_3p5"]),
agreement=agreement_ft,
total=sum(ft_mask),
agreement_rate=agreement_rate_ft,
corr=corr_ft,
jacc=jaccard_ft,
)
)
print(
REPORT_FMT_STR.format(
model="GPT-3.5 w/out fine-tuning",
inconclusive=sum(inconclusives["gpt_3p5"]),
agreement=agreement_no_ft,
total=sum(no_ft_mask),
agreement_rate=agreement_rate_no_ft,
corr=corr_no_ft,
jacc=jaccard_no_ft,
)
)
print(
f"GPT-4\n-----------------\nInconclusive Count: {sum(inconclusives['gpt_4'])}"
)
GPT-3.5 w/ fine-tuning ----------------- Number of inconclusives: 15 Number of agreements with GPT-4: 41 out of 47 Agreement rate: 0.8723404255319149 Correlation: 0.765365523658036 Jaccard: 0.773126734505088 GPT-3.5 w/out fine-tuning ----------------- Number of inconclusives: 24 Number of agreements with GPT-4: 32 out of 38 Agreement rate: 0.8421052631578947 Correlation: 0.671929323262293 Jaccard: 0.7308712958867757 GPT-4 ----------------- Inconclusive Count: 4
结论¶
从上面的数字可以看出,微调后的 GPT-3.5 评估器比未微调的 GPT-3.5 评估器具有更高的一致率、相关性和 Jaccard 相似度。更重要的是,我们看到微调后不确定计数也有所下降。总的来说,我们看到这里的微调帮助我们获得了一个更接近 GPT-4 评估器(从而间接更接近人类判断)的 GPT-3.5 评估器,同时也有助于弥补未微调的 GPT-3.5 可能存在的位置偏差。