本 Jupyter Notebook 有关微调一个 LLM 评估器,该评估器用于评估另一个 LLM 对用户查询的响应。更具体地说,我们演示如何使用 llama_index 库将知识从 GPT-4 评估器蒸馏到 GPT-3.5 评估器。为此,我们将执行以下步骤
生成数据集:train 和 test
- 执行知识蒸馏 (使用
train) - 在
test上评估蒸馏模型 - 更具体地说,我们将使用
CorrectnessEvaluator作为我们的 LLM 评估器。
In [ ]
%pip install llama-index-readers-wikipedia
%pip install llama-index-finetuning
%pip install llama-index-llms-openai
%pip install llama-index-finetuning-callbacks
%pip install llama-index-llms-huggingface-api
# NOTE: this notebook makes several API calls to generate text with OpenAI GPT
# models as well as models hosted on HuggingFace. If you prefer not to wait for
# these generations, then the data for this notebook can be obtained with the
# `wget` command provided below.
# !wget "https://www.dropbox.com/scl/fo/3kkm8v6qvhxnu449xwp3d/h?rlkey=fxom1yixru1nags9mmao1hkg2&dl=1" -O correctness.zip
import nest_asyncio
nest_asyncio.apply()
import os
# we will be using models on HuggingFace as our LLM answer generators
HUGGING_FACE_TOKEN = os.getenv("HUGGING_FACE_TOKEN")
# we will use GPT-4 and GPT-3.5 + OpenAI Fine-Tuning
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
对于我们将生成问题并提示各种 LLMs 回答的数据集,我们将使用 WikipediaReader 读取 "History of
" 对于几个城市。
!pip install wikipedia -q
[notice] A new release of pip is available: 23.2.1 -> 23.3.1 [notice] To update, run: pip install --upgrade pip
# wikipedia pages
from llama_index.readers.wikipedia import WikipediaReader
cities = [
"San Francisco",
"Toronto",
"New York",
"Vancouver",
"Montreal",
"Tokyo",
"Singapore",
"Paris",
]
documents = WikipediaReader().load_data(
pages=[f"History of {x}" for x in cities]
)
现在我们有了 `Document` 的训练集和测试集,下一步是生成问题。为此,我们将使用 `DatasetGenerator`,它使用 LLM 从给定的文档集生成问题。
生成问题¶
QUESTION_GEN_PROMPT = ( "您是一位教师/教授。您的任务是设置 " "一个测验/考试。利用提供的上下文,提出 " "一个捕捉上下文中重要事实的 " "单一问题。将问题限制在提供的上下文信息内。" )
QUESTION_GEN_PROMPT = (
"You are a Teacher/ Professor. Your task is to setup "
"a quiz/examination. Using the provided context, formulate "
"a single question that captures an important fact from the "
"context. Restrict the question to the context information provided."
)
# generate questions against chunks
from llama_index.core.evaluation import DatasetGenerator
from llama_index.llms.openai import OpenAI
# set context for llm provider
gpt_35_llm = OpenAI(model="gpt-3.5-turbo", temperature=0.3)
# instantiate a DatasetGenerator
dataset_generator = DatasetGenerator.from_documents(
documents,
question_gen_query=QUESTION_GEN_PROMPT,
llm=gpt_35_llm,
num_questions_per_chunk=25,
)
qrd = dataset_generator.generate_dataset_from_nodes(num=350)
# If you want to save it for future use
# qrd.save_json("qrd.json")
下一步是使用 LLM 生成答案。提醒一下,目的是判断这些生成的答案。因此,稍后我们将使用 GPT 模型来判断这些答案。
为了生成问题的答案,我们将使用另一个 LLM,即:Llama-2。为此,我们首先为我们的文档创建一个向量存储以及相关的检索器,该 LLM 答案生成器将使用它们。
from llama_index.core import VectorStoreIndex from llama_index.core.retrievers import VectorIndexRetriever # 创建向量索引 the_index = VectorStoreIndex.from_documents(documents=documents) # 在此索引上创建检索器 the_retriever = VectorIndexRetriever( index=the_index, similarity_top_k=2, )
from llama_index.core import VectorStoreIndex
from llama_index.core.retrievers import VectorIndexRetriever
# Create vector index
the_index = VectorStoreIndex.from_documents(documents=documents)
# Create the retriver on this index
the_retriever = VectorIndexRetriever(
index=the_index,
similarity_top_k=2,
)
RetrieverQueryEngine,它将接收我们的查询(即问题)进行处理。请注意,我们使用 HuggingFaceInferenceAPI 作为我们的 LLM 答案生成器,并且 Llama-2 需要权限。如果您尚未获得对这些模型的访问权限,可以随意将 Llama-2 替换为您选择的其他模型。此时,我们将生成的问答对分成两组:一组用于构建 `train_dataset`,另一组用于构建下一节中的 `test_dataset`。
from llama_index.core.query_engine import RetrieverQueryEngine from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI llm = HuggingFaceInferenceAPI( model_name="meta-llama/Llama-2-7b-chat-hf", context_window=2048, # 用于 refine token=HUGGING_FACE_TOKEN, ) query_engine = RetrieverQueryEngine.from_args(retriever=the_retriever, llm=llm)
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI
llm = HuggingFaceInferenceAPI(
model_name="meta-llama/Llama-2-7b-chat-hf",
context_window=2048, # to use refine
token=HUGGING_FACE_TOKEN,
)
query_engine = RetrieverQueryEngine.from_args(retriever=the_retriever, llm=llm)
/Users/nerdai/Library/Caches/pypoetry/virtualenvs/llama-index-e6cjsBOJ-py3.10/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
import tqdm
# we will use 65% of the generated questions for training
train_dataset = []
num_train_questions = int(0.65 * len(qrd.qr_pairs))
for q, a in tqdm.tqdm(qrd.qr_pairs[:num_train_questions]):
# data for this q
data_entry = {"question": q, "reference": a}
response = query_engine.query(q)
response_struct = {}
response_struct["model"] = "llama-2"
response_struct["text"] = str(response)
response_struct["context"] = (
response.source_nodes[0].node.text[:1000] + "..."
)
data_entry["response_data"] = response_struct
train_dataset.append(data_entry)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 79/79 [08:30<00:00, 6.46s/it]
如前所述,本指南的重点是从 GPT-4 评估器微调一个 LLM 评估器。因此,为了完成我们的 `train_dataset`,我们现在需要实例化我们的 GPT-4 评估器,并让它评估由 Llama-2 提供的答案。为此,我们将使用 `CorrectnessEvaluator` 类。这个评估器将比较答案与参考答案,并给出 1 到 5 之间的分数(分数越高越好),表示提供的答案与参考答案的匹配程度。
另请注意,我们使用 `OpenAIFineTuningHandler`,它将收集我们最终微调 GPT-3.5 所需的所有聊天记录。
# 实例化 gpt-4 评估器 from llama_index.llms.openai import OpenAI from llama_index.finetuning.callbacks import OpenAIFineTuningHandler from llama_index.core.callbacks import CallbackManager from llama_index.core.evaluation import CorrectnessEvaluator finetuning_handler = OpenAIFineTuningHandler() callback_manager = CallbackManager([finetuning_handler]) gpt_4_llm = OpenAI( temperature=0, model="gpt-4", callback_manager=callback_manager ) gpt4_judge = CorrectnessEvaluator(llm=gpt_4_llm)
# instantiate the gpt-4 judge
from llama_index.llms.openai import OpenAI
from llama_index.finetuning.callbacks import OpenAIFineTuningHandler
from llama_index.core.callbacks import CallbackManager
from llama_index.core.evaluation import CorrectnessEvaluator
finetuning_handler = OpenAIFineTuningHandler()
callback_manager = CallbackManager([finetuning_handler])
gpt_4_llm = OpenAI(
temperature=0, model="gpt-4", callback_manager=callback_manager
)
gpt4_judge = CorrectnessEvaluator(llm=gpt_4_llm)
import tqdm
# for `training`
for data_entry in tqdm.tqdm(train_dataset):
eval_result = await gpt4_judge.aevaluate(
query=data_entry["question"],
response=data_entry["response_data"]["text"],
context=data_entry["response_data"]["context"],
reference=data_entry["reference"],
)
# save final result
judgement = {}
judgement["llm"] = "gpt_4"
judgement["score"] = eval_result.score
judgement["text"] = eval_result.response
data_entry["evaluations"] = [judgement]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 79/79 [12:31<00:00, 9.51s/it]
finetuning_handler.save_finetuning_events("correction_finetuning_events.jsonl")
Wrote 79 examples to correction_finetuning_events.jsonl
好的,现在是时候将知识从 GPT-4 蒸馏到 GPT-3.5 了。为此,我们将使用 `OpenAIFinetuneEngine` 类以及我们刚刚创建的 `correction_finetuning_events.jsonl` 文件。
from llama_index.finetuning import OpenAIFinetuneEngine finetune_engine = OpenAIFinetuneEngine( "gpt-3.5-turbo", "correction_finetuning_events.jsonl", )
from llama_index.finetuning import OpenAIFinetuneEngine
finetune_engine = OpenAIFinetuneEngine(
"gpt-3.5-turbo",
"correction_finetuning_events.jsonl",
)
# We can check the status of our current job as follows
# This may take some time ...
finetune_engine.finetune()
Num examples: 79
First example:
{'role': 'system', 'content': '\nYou are an expert evaluation system for a question answering chatbot.\n\nYou are given the following information:\n- a user query,\n- a reference answer, and\n- a generated answer.\n\nYour job is to judge the relevance and correctness of the generated answer.\nOutput a single score that represents a holistic evaluation.\nYou must return your response in a line with only the score.\nDo not return answers in any other format.\nOn a separate line provide your reasoning for the score as well.\n\nFollow these guidelines for scoring:\n- Your score has to be between 1 and 5, where 1 is the worst and 5 is the best.\n- If the generated answer is not relevant to the user query, you should give a score of 1.\n- If the generated answer is relevant but contains mistakes, you should give a score between 2 and 3.\n- If the generated answer is relevant and fully correct, you should give a score between 4 and 5.\n\nExample Response:\n4.0\nThe generated answer has the exact same metrics as the reference answer, but it is not as concise.\n\n'}
{'role': 'user', 'content': '\n## User Query\nWhat event in 1906 caused significant damage to San Francisco but was followed by a quick rebuild?\n\n## Reference Answer\nThe great earthquake and fire in 1906 caused significant damage to San Francisco but was followed by a quick rebuild.\n\n## Generated Answer\n1906 earthquake and fire.\n'}
{'role': 'assistant', 'content': '4.0\nThe generated answer is relevant and correct, but it lacks the detail and context provided in the reference answer.'}
No errors found
Num examples missing system message: 0
Num examples missing user message: 0
#### Distribution of num_messages_per_example:
min / max: 3, 3
mean / median: 3.0, 3.0
p5 / p95: 3.0, 3.0
#### Distribution of num_total_tokens_per_example:
min / max: 315, 782
mean / median: 479.49367088607596, 465.0
p5 / p95: 355.6, 634.6
#### Distribution of num_assistant_tokens_per_example:
min / max: 19, 110
mean / median: 57.63291139240506, 56.0
p5 / p95: 29.6, 83.2
0 examples may be over the 4096 token limit, they will be truncated during fine-tuning
Dataset has ~37880 tokens that will be charged for during training
By default, you'll train for 3 epochs on this dataset
By default, you'll be charged for ~113640 tokens
As of August 22, 2023, fine-tuning gpt-3.5-turbo is $0.008 / 1K Tokens.
This means your total cost for training will be $0.30304000000000003 per epoch.
finetune_engine.get_current_job()
<FineTuningJob fine_tuning.job id=ftjob-9y8G7rzbCkzPjsKtPMsfwRSu at 0x1778d6a70> JSON: {
"object": "fine_tuning.job",
"id": "ftjob-9y8G7rzbCkzPjsKtPMsfwRSu",
"model": "gpt-3.5-turbo-0613",
"created_at": 1698851177,
"finished_at": 1698851823,
"fine_tuned_model": "ft:gpt-3.5-turbo-0613:llamaindex::8G7FovVj",
"organization_id": "org-1ZDAvajC6v2ZtAP9hLEIsXRz",
"result_files": [
"file-bx2ObrpVPq7Q2pmv743W1eFQ"
],
"status": "succeeded",
"validation_file": null,
"training_file": "file-xAwZ2NSzbck3p8u24kznzySX",
"hyperparameters": {
"n_epochs": 3
},
"trained_tokens": 113166,
"error": null
}
现在我们有了微调后的 GPT-3.5,让我们看看它在测试集上的表现如何。但首先,您还记得我们说过要等到需要时才创建 `test_dataset` 吗?现在正是时候。因此,我们将在这里重复创建 `train_dataset` 的过程,但这次是针对 `test_dataset`。
注意:生成这些答案和评估需要一些时间。
# 使用 Llama-2 生成测试问题的答案 test_dataset = [] for q, a in tqdm.tqdm(qrd.qr_pairs[num_train_questions:]): # 此问答对的数据 data_entry = {"question": q, "reference": a} response = query_engine.query(q) response_struct = {} response_struct["model"] = "llama-2" response_struct["text"] = str(response) response_struct["context"] = ( response.source_nodes[0].node.text[:1000] + "..." ) data_entry["response_data"] = response_struct test_dataset.append(data_entry)
# Use Llama-2 to generate answers to the test questions
test_dataset = []
for q, a in tqdm.tqdm(qrd.qr_pairs[num_train_questions:]):
# data for this q
data_entry = {"question": q, "reference": a}
response = query_engine.query(q)
response_struct = {}
response_struct["model"] = "llama-2"
response_struct["text"] = str(response)
response_struct["context"] = (
response.source_nodes[0].node.text[:1000] + "..."
)
data_entry["response_data"] = response_struct
test_dataset.append(data_entry)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 44/44 [05:07<00:00, 6.99s/it]
# get the gpt-4 judgements on the Llama-2 answers
for data_entry in tqdm.tqdm(test_dataset):
eval_result = await gpt4_judge.aevaluate(
query=data_entry["question"],
response=data_entry["response_data"]["text"],
context=data_entry["response_data"]["context"],
reference=data_entry["reference"],
)
# save final result
judgement = {}
judgement["llm"] = "gpt_4"
judgement["score"] = eval_result.score
judgement["text"] = eval_result.response
data_entry["evaluations"] = [judgement]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 44/44 [06:52<00:00, 9.37s/it]
from llama_index.core.evaluation import EvaluationResult
# use our fine-tuned GPT-3.5 to evaluate the answers
ft_llm = finetune_engine.get_finetuned_model()
ft_gpt_3p5_judge = CorrectnessEvaluator(llm=ft_llm)
for data_entry in tqdm.tqdm(test_dataset):
eval_result = await ft_gpt_3p5_judge.aevaluate(
query=data_entry["question"],
response=data_entry["response_data"]["text"],
context=data_entry["response_data"]["context"],
reference=data_entry["reference"],
)
# save final result
judgement = {}
judgement["llm"] = "ft_gpt_3p5"
judgement["score"] = eval_result.score
judgement["text"] = eval_result.response
data_entry["evaluations"] += [judgement]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 44/44 [00:44<00:00, 1.02s/it]
# Similarly, use a non-fine-tuned judge to evaluate the answers
gpt_3p5_llm = OpenAI(model="gpt-3.5-turbo")
gpt_3p5_judge = CorrectnessEvaluator(llm=gpt_3p5_llm)
for data_entry in tqdm.tqdm(test_dataset):
eval_result = await gpt_3p5_judge.aevaluate(
query=data_entry["question"],
response=data_entry["response_data"]["text"],
context=data_entry["response_data"]["context"],
reference=data_entry["reference"],
)
# save final result
judgement = {}
judgement["llm"] = "gpt_3p5"
judgement["score"] = eval_result.score
judgement["text"] = eval_result.response
data_entry["evaluations"] += [judgement]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 44/44 [01:36<00:00, 2.19s/it]
呼!现在我们已经生成了所有 LLM 评估器对测试查询中 Llama-2/Mistral 答案的评估。现在让我们从定量角度看看微调后的 GPT-3.5 与 GPT-4 的接近程度。
为此,我们报告微调后(以及未微调)的 GPT-3.5 与 GPT-4 评估器分数之间的相关性。
REPORT_FMT_STR = ( "{model}\n" "-----------------\n" "观测数: {total_obs}\n" "与 GPT-4 的相关性: {corr}\n" )
REPORT_FMT_STR = (
"{model}\n"
"-----------------\n"
"Number of obs.: {total_obs}\n"
"Correlation with GPT-4: {corr}\n"
)
import numpy as np
scores = {"gpt_4": [], "gpt_3p5": [], "ft_gpt_3p5": []}
for ix, d in enumerate(test_dataset):
for e in d["evaluations"]:
scores[e["llm"]].append(e["score"])
# numpy conversion
np_scores_gpt_4 = np.array(scores["gpt_4"])
np_scores_gpt_3p5 = np.array(scores["gpt_3p5"])
np_scores_ft_gpt_3p5 = np.array(scores["ft_gpt_3p5"])
# correlations
corr_ft = np.corrcoef(np_scores_gpt_4, np_scores_ft_gpt_3p5)[0, 1]
corr_no_ft = np.corrcoef(np_scores_gpt_4, np_scores_gpt_3p5)[0, 1]
print(
REPORT_FMT_STR.format(
model="GPT-3.5 w/ fine-tuning",
total_obs=np_scores_gpt_4.shape[0],
corr=corr_ft,
)
)
print("\n")
print(
REPORT_FMT_STR.format(
model="GPT-3.5 w/out fine-tuning",
total_obs=np_scores_gpt_4.shape[0],
corr=corr_no_ft,
)
)
GPT-3.5 w/ fine-tuning ----------------- Number of obs.: 44 Correlation with GPT-4: 0.9279850303778618 GPT-3.5 w/out fine-tuning ----------------- Number of obs.: 44 Correlation with GPT-4: 0.8737418723878325
从以上数据可以看出,微调后的 GPT-3.5 评估器与 GPT-4 的相关性高于未微调的同类模型。因此,在这种情况下,我们看到微调帮助我们获得了一个更接近 GPT-4 评估器(进而更接近人类判断)的 GPT-3.5 评估器。
回到顶部