知识蒸馏用于微调 GPT-3.5 评估器（成对比较）¶

最近的研究表明，GPT-4 在评估 LLM 生成文本时能够与人类评判器高度一致（例如，参见 [1]、[2]）。在本笔记本中，我们演示了如何使用 llama_index 库将知识从 GPT-4 蒸馏到 GPT-3.5，以便更小的 GPT-3.5 的性能更接近 GPT-4；并以此为中介，更接近人类评判器。

为此，我们将执行以下主要步骤

生成数据集：train_dataset 和 test_dataset
执行知识蒸馏（使用 train_dataset）
评估蒸馏后的模型在 test_dataset 上的表现

In [ ]

已复制！

%pip install llama-index-readers-wikipedia
%pip install llama-index-finetuning
%pip install llama-index-llms-openai
%pip install llama-index-llms-mistralai
%pip install llama-index-llms-huggingface-api
%pip install llama-index-readers-wikipedia %pip install llama-index-finetuning %pip install llama-index-llms-openai %pip install llama-index-llms-mistralai %pip install llama-index-llms-huggingface-api

In [ ]

已复制！

# NOTE: this notebook makes several API calls to generate text with OpenAI GPT
# models as well as models hosted on HuggingFace. If you prefer not to wait for
# these generations, then the data for this notebook can be obtained with the
# `wget` command provided below.

# !wget "https://www.dropbox.com/scl/fo/m7skpjdbpb0g3p76y6epe/h?rlkey=omh2ysgh9qqqztf81qvjlivu2&dl=1" -O pairwise.zip
# 注意：本笔记本会进行多次 API 调用，以使用 OpenAI GPT 模型以及托管在 HuggingFace 上的模型生成文本。如果您不想等待这些生成过程，则可以使用下面提供的 `wget` 命令获取本笔记本所需的数据。 # !wget "https://www.dropbox.com/scl/fo/m7skpjdbpb0g3p76y6epe/h?rlkey=omh2ysgh9qqqztf81qvjlivu2&dl=1" -O pairwise.zip

In [ ]

已复制！

import nest_asyncio

nest_asyncio.apply()
import nest_asyncio nest_asyncio.apply()

In [ ]

已复制！

import os

# we will be using models on HuggingFace as our LLM answer generators
HUGGING_FACE_TOKEN = os.getenv("HUGGING_FACE_TOKEN")

# we will use GPT-4 and GPT-3.5 + OpenAI Fine-Tuning
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
import os # 我们将使用 HuggingFace 上的模型作为我们的 LLM 答案生成器 HUGGING_FACE_TOKEN = os.getenv("HUGGING_FACE_TOKEN") # 我们将使用 GPT-4 和 GPT-3.5 + OpenAI 微调 OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

In [ ]

已复制！





import pandas as pd


# define jupyter display function
def display_eval_df(question, source, answer_a, answer_b, result) -> None:
    """Pretty print question/answer + gpt-4 judgement dataset."""
    eval_df = pd.DataFrame(
        {
            "Question": question,
            "Source": source,
            "Model A": answer_a["model"],
            "Answer A": answer_a["text"],
            "Model B": answer_b["model"],
            "Answer B": answer_b["text"],
            "Score": result.score,
            "Judgement": result.feedback,
        },
        index=[0],
    )
    eval_df = eval_df.style.set_properties(
        **{
            "inline-size": "300px",
            "overflow-wrap": "break-word",
        },
        subset=["Answer A", "Answer B"]
    )
    display(eval_df)
import pandas as pd # 定义 jupyter 显示函数 def display_eval_df(question, source, answer_a, answer_b, result) -> None: """Pretty print question/answer + gpt-4 judgement dataset.""" eval_df = pd.DataFrame( { "Question": question, "Source": source, "Model A": answer_a["model"], "Answer A": answer_a["text"], "Model B": answer_b["model"], "Answer B": answer_b["text"], "Score": result.score, "Judgement": result.feedback, }, index=[0], ) eval_df = eval_df.style.set_properties( **{ "inline-size": "300px", "overflow-wrap": "break-word", }, subset=["Answer A", "Answer B"] ) display(eval_df)

步骤 1：生成数据集：`train_dataset` 和 `test_dataset`¶

对于我们将生成问题并提示各种 LLM 回答的数据集，我们将使用 WikipediaReader 读取几个城市的“历史”。我们将把我们的城市分成两个列表：一个用于 train_dataset，另一个用于 test_dataset。

In [ ]

已复制！

!pip install wikipedia -q
!pip install wikipedia -q

[notice] A new release of pip is available: 23.2.1 -> 23.3.1
[notice] To update, run: pip install --upgrade pip

In [ ]

已复制！





# wikipedia pages
from llama_index.readers.wikipedia import WikipediaReader

train_cities = [
    "San Francisco",
    "Toronto",
    "New York City",
    "Vancouver",
    "Montreal",
    "Boston",
]

test_cities = [
    "Tokyo",
    "Singapore",
    "Paris",
]

train_documents = WikipediaReader().load_data(
    pages=[f"History of {x}" for x in train_cities]
)
test_documents = WikipediaReader().load_data(
    pages=[f"History of {x}" for x in test_cities]
)
# wikipedia pages from llama_index.readers.wikipedia import WikipediaReader train_cities = [ "San Francisco", "Toronto", "New York City", "Vancouver", "Montreal", "Boston", ] test_cities = [ "Tokyo", "Singapore", "Paris", ] train_documents = WikipediaReader().load_data( pages=[f"History of {x}" for x in train_cities] ) test_documents = WikipediaReader().load_data( pages=[f"History of {x}" for x in test_cities] )

使用 `DatasetGenerator` 构建 `train_dataset` 和 `test_dataset`¶

现在我们有了训练集和测试集 Document，下一步是生成问题。为此，我们将使用 DatasetGenerator，它使用 LLM 从给定的文档集生成问题。

生成问题¶

In [ ]

已复制！





QUESTION_GEN_PROMPT = (
    "You are a Teacher/ Professor. Your task is to setup "
    "a quiz/examination. Using the provided context, formulate "
    "a single question that captures an important fact from the "
    "context. Restrict the question to the context information provided."
)
QUESTION_GEN_PROMPT = ( "您是一位老师/教授。您的任务是设置一个测验/考试。" "使用提供的上下文，提出一个捕获上下文重要事实的单一问题。" "将问题限制在提供的上下文信息内。" )

说完这些，让我们开始行动吧。首先，我们将下载参考 PDF 文档并根据它创建问题集。

In [ ]

已复制！





# generate questions against chunks
from llama_index.core.evaluation import DatasetGenerator
from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-3.5-turbo", temperature=0.3)


# instantiate DatasetGenerator's for train and test
train_dataset_generator = DatasetGenerator.from_documents(
    train_documents,
    question_gen_query=QUESTION_GEN_PROMPT,
    llm=llm,
    show_progress=True,
    num_questions_per_chunk=25,
)

test_dataset_generator = DatasetGenerator.from_documents(
    test_documents,
    question_gen_query=QUESTION_GEN_PROMPT,
    llm=llm,
    show_progress=True,
    num_questions_per_chunk=25,
)
# generate questions against chunks from llama_index.core.evaluation import DatasetGenerator from llama_index.llms.openai import OpenAI llm = OpenAI(model="gpt-3.5-turbo", temperature=0.3) # instantiate DatasetGenerator's for train and test train_dataset_generator = DatasetGenerator.from_documents( train_documents, question_gen_query=QUESTION_GEN_PROMPT, llm=llm, show_progress=True, num_questions_per_chunk=25, ) test_dataset_generator = DatasetGenerator.from_documents( test_documents, question_gen_query=QUESTION_GEN_PROMPT, llm=llm, show_progress=True, num_questions_per_chunk=25, )

In [ ]

已复制！

# use DatasetGenerator to create questions from nodes
train_questions = train_dataset_generator.generate_questions_from_nodes(
    num=200
)
# use DatasetGenerator to create questions from nodes train_questions = train_dataset_generator.generate_questions_from_nodes( num=200 )

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 75/75 [00:02<00:00, 36.34it/s]

In [ ]

已复制！

test_questions = test_dataset_generator.generate_questions_from_nodes(num=150)
test_questions = test_dataset_generator.generate_questions_from_nodes(num=150)

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 64/64 [00:02<00:00, 29.98it/s]

In [ ]

已复制！

len(train_questions), len(test_questions)
len(train_questions), len(test_questions)

Out[ ]

(75, 64)

In [ ]

已复制！

# let's take a look at a few of these
train_questions[:3]
# 让我们看看其中几个 train_questions[:3]

Out[ ]

['What event in 1906 caused significant damage to San Francisco but was followed by a quick rebuild?',
 'What was the name of the first significant homestead established outside the immediate vicinity of Mission Dolores in San Francisco?',
 "What event in 1855 led to the establishment of San Francisco's first county hospital and the development of California's system of county hospitals for the poor?"]

In [ ]

已复制！

test_questions[:3]
test_questions[:3]

Out[ ]

['Question: What was the name of the oldest Buddhist temple in Tokyo, founded in 628?',
 'What event marked the end of the samurai system and feudal class divisions in Tokyo?',
 'Question: What role did the Tokyo Imperial University play in the Meiji Era?']

生成问题的答案¶

下一步是使用 LLM 生成答案。提醒一下，这里的重点是评估这些生成的答案。因此稍后，我们将使用 GPT 模型来评估这些答案。

但对于问题的答案生成，我们将使用另外两个 LLM，即：Llama-2 和 Mistral。为此，我们首先为我们的文档创建一个向量存储和一个关联的检索器，这两个 LLM 答案生成器都将使用它们。

In [ ]

已复制！





from llama_index.core import VectorStoreIndex
from llama_index.core.retrievers import VectorIndexRetriever

# Create vector index
train_index = VectorStoreIndex.from_documents(documents=train_documents)

# Create the retriver on this index
train_retriever = VectorIndexRetriever(
    index=train_index,
    similarity_top_k=2,
)

# Create vector index for test to be used later
test_index = VectorStoreIndex.from_documents(documents=test_documents)

# Create the retriver for test to be used later
test_retriever = VectorIndexRetriever(
    index=test_index,
    similarity_top_k=2,
)
from llama_index.core import VectorStoreIndex from llama_index.core.retrievers import VectorIndexRetriever # 创建向量索引 train_index = VectorStoreIndex.from_documents(documents=train_documents) # 在此索引上创建检索器 train_retriever = VectorIndexRetriever( index=train_index, similarity_top_k=2, ) # 创建测试向量索引以供稍后使用 test_index = VectorStoreIndex.from_documents(documents=test_documents) # 创建测试检索器以供稍后使用 test_retriever = VectorIndexRetriever( index=test_index, similarity_top_k=2, )

从这里我们将构建 RetrieverQueryEngine，它们将接收我们的查询（即问题）进行处理。请注意，我们使用 HuggingFaceInferenceAPI 作为我们的 LLM 答案生成器，并且 Llama-2 需要权限。如果您尚未获得这些模型的访问权限，请随意将 Llama-2 替换为您选择的其他模型。

In [ ]

已复制！





from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI


def create_query_engine(
    hf_name: str, retriever: VectorIndexRetriever, hf_llm_generators: dict
) -> RetrieverQueryEngine:
    """Create a RetrieverQueryEngine using the HuggingFaceInferenceAPI LLM"""
    if hf_name not in hf_llm_generators:
        raise KeyError("model not listed in hf_llm_generators")
    llm = HuggingFaceInferenceAPI(
        model_name=hf_llm_generators[hf_name],
        context_window=2048,  # to use refine
        token=HUGGING_FACE_TOKEN,
    )
    return RetrieverQueryEngine.from_args(retriever=retriever, llm=llm)
from llama_index.core.query_engine import RetrieverQueryEngine from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI def create_query_engine( hf_name: str, retriever: VectorIndexRetriever, hf_llm_generators: dict ) -> RetrieverQueryEngine: """使用 HuggingFaceInferenceAPI LLM 创建 RetrieverQueryEngine""" if hf_name not in hf_llm_generators: raise KeyError("model not listed in hf_llm_generators") llm = HuggingFaceInferenceAPI( model_name=hf_llm_generators[hf_name], context_window=2048, # to use refine token=HUGGING_FACE_TOKEN, ) return RetrieverQueryEngine.from_args(retriever=retriever, llm=llm)

In [ ]

已复制！





# define our llm-generators (query_engines)
hf_llm_generators = {
    "mistral-7b-instruct": "mistralai/Mistral-7B-Instruct-v0.1",
    "llama2-7b-chat": "meta-llama/Llama-2-7b-chat-hf",
}

train_query_engines = {
    mdl: create_query_engine(mdl, train_retriever, hf_llm_generators)
    for mdl in hf_llm_generators.keys()
}

test_query_engines = {
    mdl: create_query_engine(mdl, test_retriever, hf_llm_generators)
    for mdl in hf_llm_generators.keys()
}
# 定义我们的 llm 生成器 (query_engines) hf_llm_generators = { "mistral-7b-instruct": "mistralai/Mistral-7B-Instruct-v0.1", "llama2-7b-chat": "meta-llama/Llama-2-7b-chat-hf", } train_query_engines = { mdl: create_query_engine(mdl, train_retriever, hf_llm_generators) for mdl in hf_llm_generators.keys() } test_query_engines = { mdl: create_query_engine(mdl, test_retriever, hf_llm_generators) for mdl in hf_llm_generators.keys() }

我们现在准备好从各种 LLM 生成答案了。我们现在就为 train_dataset 执行此操作，并推迟为 test_dataset 执行此操作，直到需要使用它的时候。

注意：生成这些内容需要一些时间。如果您不想等待，可以选择加载包含每个问题的 Llama-2 和 Mistral 答案的 train_qa.jsonl 文件。

In [ ]

已复制！





import tqdm
import random

train_dataset = []
for q in tqdm.tqdm(train_questions):
    # randomly select two LLMs to generate answers to this q
    model_versus = random.sample(list(train_query_engines.items()), 2)

    # data for this q
    data_entry = {"question": q}
    responses = []
    source = None

    # generate answers
    for name, engine in model_versus:
        response = engine.query(q)
        response_struct = {}
        response_struct["model"] = name
        response_struct["text"] = str(response)
        if source is not None:
            assert source == response.source_nodes[0].node.text[:1000] + "..."
        else:
            source = response.source_nodes[0].node.text[:1000] + "..."
        responses.append(response_struct)

    data_entry["answers"] = responses
    data_entry["source"] = source
    train_dataset.append(data_entry)
import tqdm import random train_dataset = [] for q in tqdm.tqdm(train_questions): # randomly select two LLMs to generate answers to this q model_versus = random.sample(list(train_query_engines.items()), 2) # data for this q data_entry = {"question": q} responses = [] source = None # generate answers for name, engine in model_versus: response = engine.query(q) response_struct = {} response_struct["model"] = name response_struct["text"] = str(response) if source is not None: assert source == response.source_nodes[0].node.text[:1000] + "..." else: source = response.source_nodes[0].node.text[:1000] + "..." responses.append(response_struct) data_entry["answers"] = responses data_entry["source"] = source train_dataset.append(data_entry)

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 75/75 [07:40<00:00,  6.14s/it]

获取 GPT-4 对 Mistral 和 Llama-2 答案的评估¶

如前所述，本指南的目的是从 GPT-4 评估器微调 LLM 评估器。因此，为了完善我们的 train_dataset，我们现在需要实例化我们的 GPT-4 评估器，并让它评估由其他 LLM 提供的答案：Llama-2 和 Mistral。为此，我们将使用 PairwiseComparisonEvaluator 类。这个评估器将比较这两个答案，并判断 Llama-2 的答案是否更好，Mistral 的答案是否更好，或者是否是平局。

这里有一些额外的细微之处，因为在成对评估中，我们必须注意潜在的“位置偏差”。这是指评估器偏爱首先呈现给它的答案（在提示/上下文中）。为了解决这种位置偏差，我们让 GPT-4 评估器对每个样本进行两次评估，在第二次评估中，我们调换两个答案的呈现顺序（即，第一次评估：Llama-2 然后 Mistral，第二次评估：Mistral 然后 Llama-2）。

最后，我们还使用了 OpenAIFineTuningHandler，它将收集我们将最终需要用于微调 GPT-3.5 的所有聊天历史记录。

注意：生成这些判断需要一些时间。同样，您可以选择加载 train_qa.jsonl 作为 train_dataset。此外，我们还存储了我们传递给 OpenAI 用于微调 GPT-3.5 的 JSONL 文件。

In [ ]

已复制！





# instantiate the gpt-4 judge
from llama_index.llms.openai import OpenAI
from llama_index.finetuning.callbacks import OpenAIFineTuningHandler
from llama_index.core.callbacks import CallbackManager
from llama_index.core.evaluation import PairwiseComparisonEvaluator
from llama_index.core import Settings

# NOTE: this finetuning_handler will collect 2x chat_histories for
# each query: one for original, and another for flipped
main_finetuning_handler = OpenAIFineTuningHandler()
callback_manager = CallbackManager([main_finetuning_handler])
Settings.callback_manager = callback_manager

llm_4 = OpenAI(temperature=0, model="gpt-4", callback_manager=callback_manager)

gpt4_judge = PairwiseComparisonEvaluator(llm=llm_4)
# 实例化 gpt-4 评估器 from llama_index.llms.openai import OpenAI from llama_index.finetuning.callbacks import OpenAIFineTuningHandler from llama_index.core.callbacks import CallbackManager from llama_index.core.evaluation import PairwiseComparisonEvaluator from llama_index.core import Settings # 注意：此 finetuning_handler 将收集 2 倍于每个查询的聊天历史记录： # 一个用于原始顺序，另一个用于翻转顺序 main_finetuning_handler = OpenAIFineTuningHandler() callback_manager = CallbackManager([main_finetuning_handler]) Settings.callback_manager = callback_manager llm_4 = OpenAI(temperature=0, model="gpt-4", callback_manager=callback_manager) gpt4_judge = PairwiseComparisonEvaluator(llm=llm_4)

In [ ]

已复制！





for data_entry in tqdm.tqdm(train_dataset):
    final_eval_result = await gpt4_judge.aevaluate(
        query=data_entry["question"],
        response=data_entry["answers"][0]["text"],
        second_response=data_entry["answers"][1]["text"],
        reference=data_entry["source"],
    )

    # save final result
    judgement = {}
    judgement["llm"] = "gpt_4"
    judgement["score"] = final_eval_result.score
    judgement["text"] = final_eval_result.response
    judgement["source"] = final_eval_result.pairwise_source
    data_entry["evaluations"] = [judgement]
for data_entry in tqdm.tqdm(train_dataset): final_eval_result = await gpt4_judge.aevaluate( query=data_entry["question"], response=data_entry["answers"][0]["text"], second_response=data_entry["answers"][1]["text"], reference=data_entry["source"], ) # 保存最终结果 judgement = {} judgement["llm"] = "gpt_4" judgement["score"] = final_eval_result.score judgement["text"] = final_eval_result.response judgement["source"] = final_eval_result.pairwise_source data_entry["evaluations"] = [judgement]

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 75/75 [48:04<00:00, 38.46s/it]

让我们看看其中一个 GPT-4 评估结果是什么样子的。

In [ ]

已复制！





# let's see the last one
display_eval_df(
    question=data_entry["question"],
    source=data_entry["source"],
    answer_a=data_entry["answers"][0],
    answer_b=data_entry["answers"][1],
    result=final_eval_result,
)
# 让我们看看最后一个 display_eval_df( question=data_entry["question"], source=data_entry["source"], answer_a=data_entry["answers"][0], answer_b=data_entry["answers"][1], result=final_eval_result, )

	问题	来源	模型 A	答案 A	模型 B	答案 B	分数	判断结果
0	波士顿有哪些可用于研究和保存历史文物和遗迹的资源？	1963 年，波士顿市长约翰·F·柯林斯和波士顿重建局 (BRA) 执行官爱德华·J·洛格组织了该市的储蓄银行、合作社以及联邦和州储蓄贷款协会联盟，称为波士顿银行城市更新小组 (B-BURG)，该小组将沿着蓝山大道对多切斯特、罗克斯伯里和马塔潘的部分地区进行反向红线划分。尽管马萨诸塞州第 156 届立法机关于 1950 年通过了禁止住房中的种族歧视或隔离的立法，以及约翰·F·肯尼迪总统于 1962 年颁布了第 11063 号行政命令，要求所有联邦机构防止美国所有联邦资助的补贴住房中的种族歧视，但波士顿住房管理局 (BHA) 董事会在柯林斯政府期间也积极隔离了该市的公共住房开发项目，BHA 部门至少在 1966 年之前一直进行官僚抵制融合，董事会保留了共同...	llama2-7b-chat	波士顿有多种可用于研究和保存历史文物和遗迹的资源。其中包括：1. 波士顿历史学会：该组织提供与波士顿历史相关的历史文物、照片和文档。2. Boston Mapjunction：该网站提供了 200 多张可追溯至 1630 年的波士顿历史地图，以及与当今地图对比的航空照片。3. 波士顿市考古项目和实验室：波士顿市设有一名城市考古学家，负责监督任何待开发土地的历史文物和重要性，并管理位于波士顿公共土地上的考古遗迹。该项目还有一个考古实验室以及教育和策展中心。4. Freedom House 照片集：该藏品包含 1950 年至 1975 年间罗克斯伯里的人物、地点和事件的 2,000 多张图像。5. 波士顿生命记录：该资源提供了与波士顿的出生、婚姻和死亡相关的历史记录。6. 阅读与日常生活在	mistral-7b-instruct	波士顿市设有一名城市考古学家，负责监督任何待开发土地的历史文物和重要性，并管理位于波士顿公共土地上的考古遗迹，还有一个城市考古项目和一个考古实验室、教育和策展中心。Freedom House 照片集包含 1950 年至 1975 年间罗克斯伯里人物、地点和事件的 2,000 多张图像（位于马萨诸塞州波士顿东北大学图书馆档案馆和特藏部）。	1.000000	助手 A 提供了更全面的答案，列出了波士顿可用于研究和保存历史文物和遗迹的多种资源。其中包括波士顿历史学会、Boston Mapjunction、波士顿市考古项目和实验室、Freedom House 照片集以及波士顿生命记录。这个答案更详细，并为用户提供了更广泛的资源可供探索。另一方面，助手 B 只提到了波士顿市考古项目和实验室以及 Freedom House 照片集。尽管这些都是相关的资源，但答案缺乏助手 A 回复的深度和多样性。因此，根据回复的深度、多样性和详细程度，助手 A 的答案更优越。最终裁决：[[A]]

对微调 JSONL 的特别注意¶

由于有两个评估（一个用于 LLM 答案的原始呈现顺序，另一个用于翻转顺序），我们需要小心选择正确的评估保留在我们的微调数据集中。这意味着我们需要挑选出由 OpenAIFineTuningHandler 收集到的正确事件，然后仅使用这些事件来准备我们将传递给 OpenAI 微调 API 的 JSONL 文件。

In [ ]

已复制！

main_finetuning_handler.save_finetuning_events(
    "pairwise_finetuning_events.jsonl"
)
main_finetuning_handler.save_finetuning_events( "pairwise_finetuning_events.jsonl" )

Wrote 150 examples to pairwise_finetuning_events.jsonl

In [ ]

已复制！

import json

# Get the fine_tuning_examples master dataset
with open("pairwise_finetuning_events.jsonl") as f:
    combined_finetuning_events = [json.loads(line) for line in f]
import json # 获取 fine_tuning_examples 主数据集 with open("pairwise_finetuning_events.jsonl") as f: combined_finetuning_events = [json.loads(line) for line in f]

In [ ]

已复制！





finetuning_events = (
    []
)  # for storing events using original order of presentation
flipped_finetuning_events = (
    []
)  # for storing events using flipped order of presentation

for ix, event in enumerate(combined_finetuning_events):
    if ix % 2 == 0:  # we always do original ordering first
        finetuning_events += [event]
    else:  # then we flip order and have GPT-4 make another judgement
        flipped_finetuning_events += [event]
finetuning_events = ( [] ) # 用于存储使用原始呈现顺序的事件 flipped_finetuning_events = ( [] ) # 用于存储使用翻转呈现顺序的事件 for ix, event in enumerate(combined_finetuning_events): if ix % 2 == 0: # 我们总是先进行原始顺序 finetuning_events += [event] else: # 然后我们翻转顺序并让 GPT-4 进行另一个判断 flipped_finetuning_events += [event]

In [ ]

已复制！

assert len(finetuning_events) == len(flipped_finetuning_events)
assert len(finetuning_events) == len(flipped_finetuning_events)

In [ ]

已复制！





# we need to pick which of the chat_histories to keep
resolved_finetuning_events = []
for ix, data_entry in enumerate(train_dataset):
    if data_entry["evaluations"][0]["source"] == "original":
        resolved_finetuning_events += [finetuning_events[ix]]
    elif data_entry["evaluations"][0]["source"] == "flipped":
        resolved_finetuning_events += [flipped_finetuning_events[ix]]
    else:
        continue
# 我们需要选择保留哪个聊天历史记录 resolved_finetuning_events = [] for ix, data_entry in enumerate(train_dataset): if data_entry["evaluations"][0]["source"] == "original": resolved_finetuning_events += [finetuning_events[ix]] elif data_entry["evaluations"][0]["source"] == "flipped": resolved_finetuning_events += [flipped_finetuning_events[ix]] else: continue

In [ ]

已复制！

with open("resolved_pairwise_finetuning_events.jsonl", "w") as outfile:
    for entry in resolved_finetuning_events:
        print(json.dumps(entry), file=outfile)
with open("resolved_pairwise_finetuning_events.jsonl", "w") as outfile: for entry in resolved_finetuning_events: print(json.dumps(entry), file=outfile)

步骤 2：执行知识蒸馏¶

好的，现在是时候将一些知识从 GPT-4 蒸馏到 GPT-3.5 了。为此，我们将使用 OpenAIFinetuneEngine 类以及我们刚刚创建的 resolved_pairwise_finetuning_events.jsonl 文件。

In [ ]

已复制！

from llama_index.finetuning import OpenAIFinetuneEngine

finetune_engine = OpenAIFinetuneEngine(
    "gpt-3.5-turbo",
    "resolved_pairwise_finetuning_events.jsonl",
)
from llama_index.finetuning import OpenAIFinetuneEngine finetune_engine = OpenAIFinetuneEngine( "gpt-3.5-turbo", "resolved_pairwise_finetuning_events.jsonl", )

In [ ]

已复制！

finetune_engine.finetune()
finetune_engine.finetune()

Num examples: 72
First example:
{'role': 'system', 'content': "Please act as an impartial judge and evaluate the quality of the responses provided by two AI question-answering assistants to the user question perhaps with added reference which are displayed below. You should choose the assistant that follows the user’s instructions and answers the user’s question better using the provided context. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of their responses. Begin your evaluation by comparing the two responses and provide a short explanation. Avoid any position biases and ensure that the order in which the responses were presented does not influence your decision. Do not allow the length of the responses to influence your evaluation. Do not favor certain names of the assistants. Be as objective as possible. After providing your explanation, output your final verdict by strictly following this format: '[[A]]' if assistant A is better, '[[B]]' if assistant B is better, and '[[C]]' for a tie.\n"}
{'role': 'user', 'content': "[User Question]\nWhat event in 1906 caused significant damage to San Francisco but was followed by a quick rebuild?\n\n[The Start of Reference]\n=== Reconstruction ===\nAlmost immediately after the quake re-planning and reconstruction plans were hatched to quickly rebuild the city. One of the more famous and ambitious plans, proposed before the fire, came from famed urban planner, Daniel Burnham. His bold plan called for Haussmann style avenues, boulevards, and arterial thoroughfares that radiated across the city, a massive civic center complex with classical structures, what would have been the largest urban park in the world, stretching from Twin Peaks to Lake Merced with a large athenaeum at its peak, and various other proposals. This plan was dismissed by critics (both at the time and now), as impractical and unrealistic to municipal supply and demand. Property owners and the Real Estate industry were against the idea as well due to the amounts of their land the city would have to purchase to realize such proposals. While the original street grid was restored, many of Burnham's proposals eventually saw the light of day such ...\n[The End of Reference]\n\n[The Start of Assistant A’s Answer]\n\n\nThe 1906 earthquake and fire caused significant damage to San Francisco, with over 80% of the city destroyed, including almost all of the downtown core. The earthquake, which had a magnitude of 7.8 on the Richter scale, resulted from the rupture of over 270 miles of the San Andreas Fault, centered immediately offshore of San Francisco. Water mains ruptured throughout the city, and the fires that followed burned out of control for days, trapping many residents and causing widespread destruction. Despite the devastation, the city quickly rebuilt, with many of the buildings constructed after the disaster being more earthquake-resistant than those that had come before. The official death toll from this event was initially reported as 478, but was later revised to over 3,000.\n[The End of Assistant A’s Answer]\n\n[The Start of Assistant B’s Answer]\n1906 earthquake and fire.\n[The End of Assistant B’s Answer]"}
{'role': 'assistant', 'content': "Assistant A provides a detailed and comprehensive response to the user's question, explaining the event that caused significant damage to San Francisco in 1906, which was the earthquake and fire. It also provides additional information about the earthquake's magnitude, the extent of the damage, and the city's quick rebuilding efforts. On the other hand, Assistant B's response is very brief and only mentions the earthquake and fire, without providing any additional information or context. Therefore, Assistant A's response is more helpful, relevant, accurate, and detailed. \n\nFinal Verdict: [[A]]"}
No errors found
Num examples missing system message: 0
Num examples missing user message: 0

#### Distribution of num_messages_per_example:
min / max: 3, 3
mean / median: 3.0, 3.0
p5 / p95: 3.0, 3.0

#### Distribution of num_total_tokens_per_example:
min / max: 579, 1198
mean / median: 818.9305555555555, 772.0
p5 / p95: 625.9, 1076.0

#### Distribution of num_assistant_tokens_per_example:
min / max: 66, 248
mean / median: 129.26388888888889, 117.5
p5 / p95: 81.0, 193.9

0 examples may be over the 4096 token limit, they will be truncated during fine-tuning
Dataset has ~58963 tokens that will be charged for during training
By default, you'll train for 3 epochs on this dataset
By default, you'll be charged for ~176889 tokens
As of August 22, 2023, fine-tuning gpt-3.5-turbo is $0.008 / 1K Tokens.
This means your total cost for training will be $0.471704 per epoch.

In [ ]

已复制！

# We can check the status of our current job as follows
# This may take some time ...
finetune_engine.get_current_job()
# 我们可以检查当前任务的状态如下 # 这可能需要一些时间... finetune_engine.get_current_job()

Out[ ]

<FineTuningJob fine_tuning.job id=ftjob-jLxZggQbHz2F98IlhQEI9KIw at 0x2e6b91170> JSON: {
  "object": "fine_tuning.job",
  "id": "ftjob-jLxZggQbHz2F98IlhQEI9KIw",
  "model": "gpt-3.5-turbo-0613",
  "created_at": 1698817329,
  "finished_at": 1698817949,
  "fine_tuned_model": "ft:gpt-3.5-turbo-0613:llamaindex::8FyRSSOl",
  "organization_id": "org-1ZDAvajC6v2ZtAP9hLEIsXRz",
  "result_files": [
    "file-qLTnxGSZX2rHP0Q7wJIDDNWX"
  ],
  "status": "succeeded",
  "validation_file": null,
  "training_file": "file-xsAaOBjQ949ti0qk1xHHLOiF",
  "hyperparameters": {
    "n_epochs": 3
  },
  "trained_tokens": 176457,
  "error": null
}

3 评估微调后的 GPT-3.5 评估器在测试数据集上的表现¶

现在我们已经有了微调后的 GPT-3.5，让我们看看它在测试集上的表现如何。但首先，还记得我们说过要推迟创建 test_dataset 直到需要使用它的时候吗？好吧，现在就是时候了。所以我们在这里重复创建 train_dataset 的过程，但现在是为 test_dataset。

注意：生成这些答案和评估需要一些时间。您可以选择加载包含来自所有三个 LLM 评估器的评估结果的 test_qa_complete.jsonl 文件。您可以将其加载为 test_dataset 并运行下面的指标小节中的代码。

In [ ]

已复制！





import random

# Use Llama-2 and Mistral LLMs to generate the answers to the test queries
test_dataset = []
for q in tqdm.tqdm(test_questions):
    # randomly select two LLMs to generate answers to this q
    model_versus = random.sample(list(test_query_engines.items()), 2)

    # data for this q
    data_entry = {"question": q}
    responses = []
    source = None

    # generate answers
    for name, engine in model_versus:
        response = engine.query(q)
        response_struct = {}
        response_struct["model"] = name
        response_struct["text"] = str(response)
        if source is not None:
            assert source == response.source_nodes[0].node.text[:1000] + "..."
        else:
            source = response.source_nodes[0].node.text[:1000] + "..."
        responses.append(response_struct)

    data_entry["answers"] = responses
    data_entry["source"] = source
    test_dataset.append(data_entry)
import random # 使用 Llama-2 和 Mistral LLMs 生成测试查询的答案 test_dataset = [] for q in tqdm.tqdm(test_questions): # 随机选择两个 LLM 来回答此问题 model_versus = random.sample(list(test_query_engines.items()), 2) # 此问题的数据 data_entry = {"question": q} responses = [] source = None # generate answers for name, engine in model_versus: response = engine.query(q) response_struct = {} response_struct["model"] = name response_struct["text"] = str(response) if source is not None: assert source == response.source_nodes[0].node.text[:1000] + "..." else: source = response.source_nodes[0].node.text[:1000] + "..." responses.append(response_struct) data_entry["answers"] = responses data_entry["source"] = source test_dataset.append(data_entry)

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 64/64 [28:23<00:00, 26.62s/it]

In [ ]

已复制！





# get the gpt-4 judgments on the Mistal and Llama-2 answers
for data_entry in tqdm.tqdm(test_dataset):
    final_eval_result = await gpt4_judge.aevaluate(
        query=data_entry["question"],
        response=data_entry["answers"][0]["text"],
        second_response=data_entry["answers"][1]["text"],
        reference=data_entry["source"],
    )

    # save final result
    judgement = {}
    judgement["llm"] = "gpt_4"
    judgement["score"] = final_eval_result.score
    judgement["text"] = final_eval_result.response
    judgement["source"] = final_eval_result.pairwise_source
    data_entry["evaluations"] = [judgement]
# 获取 gpt-4 对 Mistral 和 Llama-2 答案的判断 for data_entry in tqdm.tqdm(test_dataset): final_eval_result = await gpt4_judge.aevaluate( query=data_entry["question"], response=data_entry["answers"][0]["text"], second_response=data_entry["answers"][1]["text"], reference=data_entry["source"], ) # 保存最终结果 judgement = {} judgement["llm"] = "gpt_4" judgement["score"] = final_eval_result.score judgement["text"] = final_eval_result.response judgement["source"] = final_eval_result.pairwise_source data_entry["evaluations"] = [judgement]

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 64/64 [43:21<00:00, 40.66s/it]

In [ ]

已复制！





from llama_index.core.evaluation import EvaluationResult

# use our fine-tuned GPT-3.5 to evaluate the answers
ft_llm = finetune_engine.get_finetuned_model()


ft_gpt_3p5_judge = PairwiseComparisonEvaluator(llm=ft_llm)

for data_entry in tqdm.tqdm(test_dataset):
    try:
        final_eval_result = await ft_gpt_3p5_judge.aevaluate(
            query=data_entry["question"],
            response=data_entry["answers"][0]["text"],
            second_response=data_entry["answers"][1]["text"],
            reference=data_entry["source"],
        )
    except:
        final_eval_result = EvaluationResult(
            query=data_entry["question"],
            response="",
            passing=None,
            score=0.5,
            feedback="",
            pairwise_source="output-cannot-be-parsed",
        )

    # save final result
    judgement = {}
    judgement["llm"] = "ft_gpt_3p5"
    judgement["score"] = final_eval_result.score
    judgement["text"] = final_eval_result.response
    judgement["source"] = final_eval_result.pairwise_source
    data_entry["evaluations"] += [judgement]
from llama_index.core.evaluation import EvaluationResult # 使用我们微调后的 GPT-3.5 来评估答案 ft_llm = finetune_engine.get_finetuned_model() ft_gpt_3p5_judge = PairwiseComparisonEvaluator(llm=ft_llm) for data_entry in tqdm.tqdm(test_dataset): try: final_eval_result = await ft_gpt_3p5_judge.aevaluate( query=data_entry["question"], response=data_entry["answers"][0]["text"], second_response=data_entry["answers"][1]["text"], reference=data_entry["source"], ) except: final_eval_result = EvaluationResult( query=data_entry["question"], response="", passing=None, score=0.5, feedback="", pairwise_source="output-cannot-be-parsed", ) # 保存最终结果 judgement = {} judgement["llm"] = "ft_gpt_3p5" judgement["score"] = final_eval_result.score judgement["text"] = final_eval_result.response judgement["source"] = final_eval_result.pairwise_source data_entry["evaluations"] += [judgement]

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 64/64 [04:08<00:00,  3.88s/it]

In [ ]

已复制！





# Similarly, use a non-fine-tuned judge to evaluate the answers
gpt_3p5_llm = OpenAI(model="gpt-3.5-turbo")

gpt_3p5_judge = PairwiseComparisonEvaluator(llm=gpt_3p5_llm)

for data_entry in tqdm.tqdm(test_dataset):
    try:
        final_eval_result = await gpt_3p5_judge.aevaluate(
            query=data_entry["question"],
            response=data_entry["answers"][0]["text"],
            second_response=data_entry["answers"][1]["text"],
            reference=data_entry["source"],
        )
    except:
        final_eval_result = EvaluationResult(
            query=data_entry["question"],
            response="",
            passing=None,
            score=0.5,
            feedback="",
            pairwise_source="output-cannot-be-parsed",
        )

    # save final result
    judgement = {}
    judgement["llm"] = "gpt_3p5"
    judgement["score"] = final_eval_result.score
    judgement["text"] = final_eval_result.response
    judgement["source"] = final_eval_result.pairwise_source
    data_entry["evaluations"] += [judgement]
# 同样，使用未微调的评估器来评估答案 gpt_3p5_llm = OpenAI(model="gpt-3.5-turbo") gpt_3p5_judge = PairwiseComparisonEvaluator(llm=gpt_3p5_llm) for data_entry in tqdm.tqdm(test_dataset): try: final_eval_result = await gpt_3p5_judge.aevaluate( query=data_entry["question"], response=data_entry["answers"][0]["text"], second_response=data_entry["answers"][1]["text"], reference=data_entry["source"], ) except: final_eval_result = EvaluationResult( query=data_entry["question"], response="", passing=None, score=0.5, feedback="", pairwise_source="output-cannot-be-parsed", ) # 保存最终结果 judgement = {} judgement["llm"] = "gpt_3p5" judgement["score"] = final_eval_result.score judgement["text"] = final_eval_result.response judgement["source"] = final_eval_result.pairwise_source data_entry["evaluations"] += [judgement]

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 64/64 [09:32<00:00,  8.95s/it]

指标¶

呼！现在我们已经生成了所有 LLM 评估器对测试查询中 Llama-2/Mistral 答案的评估结果。现在让我们从定量角度看看微调后的 GPT-3.5 与 GPT-4 的接近程度。

为此，我们报告以下几个指标：

与 GPT-4 评估结果的一致率
与 GPT-4 评估结果的相关性
与 GPT-4 评估结果的 Jaccard 相似度

我们还报告了“不确定”计数，这是指 LLM 评估器在看到 Llama-2 和 Mistral 答案的翻转呈现顺序后改变其决定。较高的不确定计数表明 LLM 评估器容易受到位置偏差的影响，这是不好的！

In [ ]

已复制！

!pip install scikit-learn -q
!pip install scikit-learn -q

[notice] A new release of pip is available: 23.2.1 -> 23.3.1
[notice] To update, run: pip install --upgrade pip

In [ ]

已复制！





import numpy as np

# store the scores and inconclusive booleans for each sample per LLM judge
scores = {"gpt_4": [], "gpt_3p5": [], "ft_gpt_3p5": []}
inconclusives = {"gpt_4": [], "gpt_3p5": [], "ft_gpt_3p5": []}

for ix, d in enumerate(test_dataset):
    for e in d["evaluations"]:
        scores[e["llm"]].append(e["score"])
        inconclusives[e["llm"]].append(
            e["source"] not in ["original", "flipped"]
        )
import numpy as np # 为每个 LLM 评估器存储分数和不确定布尔值 scores = {"gpt_4": [], "gpt_3p5": [], "ft_gpt_3p5": []} inconclusives = {"gpt_4": [], "gpt_3p5": [], "ft_gpt_3p5": []} for ix, d in enumerate(test_dataset): for e in d["evaluations"]: scores[e["llm"]].append(e["score"]) inconclusives[e["llm"]].append( e["source"] not in ["original", "flipped"] )

In [ ]

已复制！





REPORT_FMT_STR = (
    "{model}\n"
    "-----------------\n"
    "Number of inconclusives: {inconclusive}\n"
    "Number of agreements with GPT-4: {agreement} out of {total}\n"
    "Agreement rate: {agreement_rate}\n"
    "Correlation: {corr}\n"
    "Jaccard: {jacc}\n\n"
)
REPORT_FMT_STR = ( "{model}\n" "-----------------\n" "不确定计数: {inconclusive}\n" "与 GPT-4 一致的计数: {agreement} out of {total}\n" "一致率: {agreement_rate}\n" "相关性: {corr}\n" "Jaccard: {jacc}\n\n" )

In [ ]

已复制！





from sklearn.metrics import jaccard_score

# numpy conversion
np_scores_gpt_4 = np.array(scores["gpt_4"])
np_scores_gpt_3p5 = np.array(scores["gpt_3p5"])
np_scores_ft_gpt_3p5 = np.array(scores["ft_gpt_3p5"])

# can only compare when both judges have non inconclusive results
ft_mask = ~np.array(inconclusives["gpt_4"]) * ~np.array(
    inconclusives["ft_gpt_3p5"]
)
no_ft_mask = ~np.array(inconclusives["gpt_4"]) * ~np.array(
    inconclusives["gpt_3p5"]
)

# agreement rates
agreement_ft = sum(np_scores_gpt_4[ft_mask] == np_scores_ft_gpt_3p5[ft_mask])
agreement_rate_ft = agreement_ft / sum(ft_mask)
agreement_no_ft = sum(
    np_scores_gpt_4[no_ft_mask] == np_scores_gpt_3p5[no_ft_mask]
)
agreement_rate_no_ft = agreement_no_ft / sum(no_ft_mask)

# correlations
corr_ft = np.corrcoef(np_scores_gpt_4[ft_mask], np_scores_ft_gpt_3p5[ft_mask])[
    0, 1
]
corr_no_ft = np.corrcoef(
    np_scores_gpt_4[no_ft_mask], np_scores_gpt_3p5[no_ft_mask]
)[0, 1]

# jaccard
jaccard_ft = jaccard_score(
    np_scores_gpt_4[ft_mask].astype(str),
    np_scores_ft_gpt_3p5[ft_mask].astype(str),
    average="weighted",
)
jaccard_no_ft = jaccard_score(
    np_scores_gpt_4[no_ft_mask].astype(str),
    np_scores_gpt_3p5[no_ft_mask].astype(str),
    average="weighted",
)

print(
    REPORT_FMT_STR.format(
        model="GPT-3.5 w/ fine-tuning",
        inconclusive=sum(inconclusives["ft_gpt_3p5"]),
        agreement=agreement_ft,
        total=sum(ft_mask),
        agreement_rate=agreement_rate_ft,
        corr=corr_ft,
        jacc=jaccard_ft,
    )
)
print(
    REPORT_FMT_STR.format(
        model="GPT-3.5 w/out fine-tuning",
        inconclusive=sum(inconclusives["gpt_3p5"]),
        agreement=agreement_no_ft,
        total=sum(no_ft_mask),
        agreement_rate=agreement_rate_no_ft,
        corr=corr_no_ft,
        jacc=jaccard_no_ft,
    )
)
print(
    f"GPT-4\n-----------------\nInconclusive Count: {sum(inconclusives['gpt_4'])}"
)
from sklearn.metrics import jaccard_score # numpy 转换 np_scores_gpt_4 = np.array(scores["gpt_4"]) np_scores_gpt_3p5 = np.array(scores["gpt_3p5"]) np_scores_ft_gpt_3p5 = np.array(scores["ft_gpt_3p5"]) # 只有当两个评估器都有非不确定结果时才能进行比较 ft_mask = ~np.array(inconclusives["gpt_4"]) * ~np.array( inconclusives["ft_gpt_3p5"] ) no_ft_mask = ~np.array(inconclusives["gpt_4"]) * ~np.array( inconclusives["gpt_3p5"] ) # 一致率 agreement_ft = sum(np_scores_gpt_4[ft_mask] == np_scores_ft_gpt_3p5[ft_mask]) agreement_rate_ft = agreement_ft / sum(ft_mask) agreement_no_ft = sum( np_scores_gpt_4[no_ft_mask] == np_scores_gpt_3p5[no_ft_mask] ) agreement_rate_no_ft = agreement_no_ft / sum(no_ft_mask) # 相关性 corr_ft = np.corrcoef(np_scores_gpt_4[ft_mask], np_scores_ft_gpt_3p5[ft_mask])[ 0, 1 ] corr_no_ft = np.corrcoef( np_scores_gpt_4[no_ft_mask], np_scores_gpt_3p5[no_ft_mask] )[0, 1] # jaccard jaccard_ft = jaccard_score( np_scores_gpt_4[ft_mask].astype(str), np_scores_ft_gpt_3p5[ft_mask].astype(str), average="weighted", ) jaccard_no_ft = jaccard_score( np_scores_gpt_4[no_ft_mask].astype(str), np_scores_gpt_3p5[no_ft_mask].astype(str), average="weighted", ) print( REPORT_FMT_STR.format( model="微调后的 GPT-3.5", inconclusive=sum(inconclusives["ft_gpt_3p5"]), agreement=agreement_ft, total=sum(ft_mask), agreement_rate=agreement_rate_ft, corr=corr_ft, jacc=jaccard_ft, ) ) print( REPORT_FMT_STR.format( model="未微调的 GPT-3.5", inconclusive=sum(inconclusives["gpt_3p5"]), agreement=agreement_no_ft, total=sum(no_ft_mask), agreement_rate=agreement_rate_no_ft, corr=corr_no_ft, jacc=jaccard_no_ft, ) ) print( f"GPT-4\n-----------------\n不确定计数: {sum(inconclusives['gpt_4'])}" )

GPT-3.5 w/ fine-tuning
-----------------
Number of inconclusives: 15
Number of agreements with GPT-4: 41 out of 47
Agreement rate: 0.8723404255319149
Correlation: 0.765365523658036
Jaccard: 0.773126734505088


GPT-3.5 w/out fine-tuning
-----------------
Number of inconclusives: 24
Number of agreements with GPT-4: 32 out of 38
Agreement rate: 0.8421052631578947
Correlation: 0.671929323262293
Jaccard: 0.7308712958867757


GPT-4
-----------------
Inconclusive Count: 4

结论¶

从上面的数字可以看出，微调后的 GPT-3.5 评估器比未微调的 GPT-3.5 评估器具有更高的一致率、相关性和 Jaccard 相似度。更重要的是，我们看到微调后不确定计数也有所下降。总的来说，我们看到这里的微调帮助我们获得了一个更接近 GPT-4 评估器（从而间接更接近人类判断）的 GPT-3.5 评估器，同时也有助于弥补未微调的 GPT-3.5 可能存在的位置偏差。

知识蒸馏用于微调 GPT-3.5 评估器（成对比较）¶

步骤 1：生成数据集：train_dataset 和 test_dataset¶

使用 DatasetGenerator 构建 train_dataset 和 test_dataset¶

生成问题¶

生成问题的答案¶

获取 GPT-4 对 Mistral 和 Llama-2 答案的评估¶

对微调 JSONL 的特别注意¶

步骤 2：执行知识蒸馏¶

3 评估微调后的 GPT-3.5 评估器在测试数据集上的表现¶

指标¶

结论¶

步骤 1：生成数据集：`train_dataset` 和 `test_dataset`¶

使用 `DatasetGenerator` 构建 `train_dataset` 和 `test_dataset`¶