GPT-3.5-Turbo 微调¶
在本 notebook 中,我们将逐步介绍一个微调 gpt-3.5-turbo 的示例。
具体来说,我们尝试提炼 GPT-4 的知识,通过使用 GPT-4 生成训练数据,然后对 GPT-3.5 进行微调。
所有训练数据均使用索引数据中的两个不同部分生成,从而创建训练集和评估集。
然后我们使用 OpenAIFinetuneEngine
包装器抽象进行微调。
评估使用 ragas
库进行,我们稍后将详细介绍。
%pip install llama-index-finetuning
%pip install llama-index-finetuning-callbacks
%pip install llama-index-llms-openai
# !pip install llama-index pypdf sentence-transformers ragas
import os
import openai
os.environ["OPENAI_API_KEY"] = "sk-..."
openai.api_key = os.environ["OPENAI_API_KEY"]
数据准备¶
在这里,我们首先下载将用于生成训练数据的 PDF 文件。
!curl https://www.ipcc.ch/report/ar6/wg2/downloads/report/IPCC_AR6_WGII_Chapter03.pdf --output IPCC_AR6_WGII_Chapter03.pdf
% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 20.7M 100 20.7M 0 0 397k 0 0:00:53 0:00:53 --:--:-- 417k84k 0 0:00:55 0:00:24 0:00:31 406k 0 395k 0 0:00:53 0:00:48 0:00:05 403k0 396k 0 0:00:53 0:00:53 --:--:-- 406k
下一步是生成训练集和评估集。
我们将对下载的 PDF 文件中不同部分生成 40 个问题。
我们可以对评估问题使用 GPT-3.5 来获得基线性能。
然后,我们将对训练问题使用 GPT-4 来生成训练数据。训练数据将通过 OpenAIFineTuningHandler
收集。
如果您不想花费时间和 token,此步骤是完全可选的——评估和训练问题以及训练数据也在此文件夹中提供!
训练数据生成¶
from llama_index.core import SimpleDirectoryReader
from llama_index.llms.openai import OpenAI
from llama_index.core.evaluation import DatasetGenerator
documents = SimpleDirectoryReader(
input_files=["IPCC_AR6_WGII_Chapter03.pdf"]
).load_data()
# Shuffle the documents
import random
random.seed(42)
random.shuffle(documents)
gpt_35_llm = OpenAI(model="gpt-3.5-turbo", temperature=0.3)
question_gen_query = (
"You are a Teacher/ Professor. Your task is to setup "
"a quiz/examination. Using the provided context, formulate "
"a single question that captures an important fact from the "
"context. Restrict the question to the context information provided."
)
dataset_generator = DatasetGenerator.from_documents(
documents[:50],
question_gen_query=question_gen_query,
llm=gpt_35_llm,
)
# NOTE: this may take some time. Go grab a coffee!
questions = dataset_generator.generate_questions_from_nodes(num=40)
print("Generated ", len(questions), " questions")
Generated 40 questions
with open("train_questions.txt", "w") as f:
for question in questions:
f.write(question + "\n")
评估数据生成¶
现在,让我们在完全不同的文档集上生成问题,以创建我们的评估数据集。
dataset_generator = DatasetGenerator.from_documents(
documents[
50:
], # since we generated ~1 question for 40 documents, we can skip the first 40
question_gen_query=question_gen_query,
llm=gpt_35_llm,
)
# NOTE: this may take some time. Go grab a coffee!
questions = dataset_generator.generate_questions_from_nodes(num=40)
print("Generated ", len(questions), " questions")
Generated 40 questions
with open("eval_questions.txt", "w") as f:
for question in questions:
f.write(question + "\n")
使用 GPT-3.5-Turbo 查询引擎进行初始评估¶
对于本次评估,我们将使用 ragas
评估库。
Ragas 为 RAG 管道提供了大量的评估指标,您可以在这里阅读相关内容。
对于本 notebook,我们将使用以下两个指标
answer_relevancy
- 这衡量了生成的答案与提示的相关程度。如果生成的答案不完整或包含冗余信息,得分将较低。通过计算 LLM 使用生成的答案生成给定问题的几率来量化此指标。取值范围 (0,1),越高越好。faithfulness
- 这衡量了生成的答案与给定上下文的事实一致性。这是通过一个多步骤范式完成的,包括从生成的答案创建陈述,然后根据上下文验证每个陈述。答案被缩放到 (0,1) 范围。越高越好。
questions = []
with open("eval_questions.txt", "r") as f:
for line in f:
questions.append(line.strip())
from llama_index.core import VectorStoreIndex
# limit the context window to 2048 tokens so that refine is used
from llama_index.core import Settings
Settings.context_window = 2048
index = VectorStoreIndex.from_documents(
documents,
)
query_engine = index.as_query_engine(similarity_top_k=2, llm=gpt_35_llm)
contexts = []
answers = []
for question in questions:
response = query_engine.query(question)
contexts.append([x.node.get_content() for x in response.source_nodes])
answers.append(str(response))
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import answer_relevancy, faithfulness
ds = Dataset.from_dict(
{
"question": questions,
"answer": answers,
"contexts": contexts,
}
)
result = evaluate(ds, [answer_relevancy, faithfulness])
print(result)
evaluating with [answer_relevancy]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [01:02<00:00, 20.69s/it]
evaluating with [faithfulness]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [03:52<00:00, 77.37s/it]
{'ragas_score': 0.8356, 'answer_relevancy': 0.9725, 'faithfulness': 0.7325}
使用 GPT-4 收集训练数据¶
在这里,我们使用 GPT-4 和 OpenAIFineTuningHandler
来收集我们想要用于训练的数据。
from llama_index.llms.openai import OpenAI
from llama_index.finetuning.callbacks import OpenAIFineTuningHandler
from llama_index.core.callbacks import CallbackManager
finetuning_handler = OpenAIFineTuningHandler()
callback_manager = CallbackManager([finetuning_handler])
llm = OpenAI(model="gpt-3.5-turbo", temperature=0.3)
llm.callback_manager = callback_manager
questions = []
with open("train_questions.txt", "r") as f:
for line in f:
questions.append(line.strip())
from llama_index.core import VectorStoreIndex
index = VectorStoreIndex.from_documents(
documents,
)
query_engine = index.as_query_engine(similarity_top_k=2, llm=llm)
for question in questions:
response = query_engine.query(question)
创建 OpenAIFinetuneEngine
¶
我们创建一个 OpenAIFinetuneEngine
:微调引擎将负责启动微调作业,并返回一个您可以直接插入到 LlamaIndex 其他工作流程中的 LLM 模型。
我们使用默认构造函数,但我们也可以通过 from_finetuning_handler
类方法将我们的 finetuning_handler 直接传递到此引擎中。
finetuning_handler.save_finetuning_events("finetuning_events.jsonl")
from llama_index.finetuning import OpenAIFinetuneEngine
finetune_engine = OpenAIFinetuneEngine(
"gpt-3.5-turbo",
"finetuning_events.jsonl",
# start_job_id="<start-job-id>" # if you have an existing job, can specify id here
)
# finetune_engine = OpenAIFinetuneEngine.from_finetuning_handler(
# finetuning_handler,
# "gpt-3.5-turbo",
# "tmp.jsonl"
# )
finetune_engine.finetune()
Num examples: 61 First example: {'role': 'system', 'content': "You are an expert Q&A system that is trusted around the world.\nAlways answer the query using the provided context information, and not prior knowledge.\nSome rules to follow:\n1. Never directly reference the given context in your answer.\n2. Avoid statements like 'Based on the context, ...' or 'The context information ...' or anything along those lines."} {'role': 'user', 'content': 'Context information is below.\n---------------------\npage_label: 410\nfile_name: IPCC_AR6_WGII_Chapter03.pdf\n\nIt is challenging to apply this experimental approach to communities or ecosystems (see Figure \nBox\xa03.1.1).To date, most research on community or ecosystem response to climate-induced drivers has been in large-volume (>10,000 l) \nmesocosms (Riebesell and Gattuso, 2014), or at natural analogues such as CO 2 seeps, in which only one driver (ocean acidification) is \naltered (see (4) in Figure Box\xa03.1.1).Only very recently have two drivers been incorporated into climate-change manipulation studies \nexamining responses of primary producers to secondary consumers (see (5) in Figure Box\xa03.1.1a; Nagelkerken et\xa0al., 2020).Therefore, \n‘natural experiments’ from the geological past (Reddin et\xa0al., 2020) provide insights into how food webs and their constituents respond to \ncomplex change involving multiple drivers.Contemporary observations are occasionally long enough (>50\xa0years) to capture community \nresponses to complex climate change.For example, Brun et\xa0al.(2019) reported a shift in zooplankton community structure in the North \nAtlantic (1960–2014), with major biogeochemical ramifications.Conducting sufficiently long manipulation experiments to study the effect of adaptation on organisms is equally difficult (see Figure \nBox\xa03.1.1b), with much research restricted to multi-year studies of the microevolution of fast-growing (more than one division per day) \nphytoplankton species responding to single drivers (Lohbeck et\xa0al., 2012; Schaum et\xa0al., 2016).In a few experimental evolution studies \n(see (7) in Figure Box\xa03.1.1a; Brennan et\xa0al., 2017), multiple drivers have been used, but none have used communities or ecosystems (see \nFigure Box\xa03.1.1b).Nevertheless, the fossil record provides limited evidence of adaptations to less rapid (relative to present day) climate \nchange (Jackson et\xa0al., 2018).Despite the need to explore ecological or biogeochemical responses to projected future ocean conditions, \nlogistical challenges require that assessments of climate-change impacts at scales larger than mesocosms use large-scale, long-term in \nsitu observational studies (as documented in Section\xa03.4).\n\npage_label: 409\nfile_name: IPCC_AR6_WGII_Chapter03.pdf\n\n3\n409Oceans and Coastal Ecosystems and Their Services Chapter 3\nunderlies inhibited thermal adaptation under nitrogen-limited \nconditions (low confidence) (Aranguren-Gassis et\xa0 al., 2019).When \nselection is strong due to unfavourable environmental conditions, \nmicrobial populations can encounter functional and evolutionary \ntrade-offs evidenced by reducing growth rates while increasing \ntolerance and metabolism of reactive oxygen species (Lindberg and \nCollins, 2020).Other trade-offs can be observed in offspring quality \nand number (Lindberg and Collins, 2020).These findings contribute \ntowards a mechanistic framework describing the range of evolutionary \nstrategies in response to multiple drivers (Collins et\xa0al., 2020), but other \nhazards, such as extreme events (e.g., MHWs), still need to be included \nbecause their characteristics may alter the potential for adaptation of \nspecies and populations to climate change (Gruber et\xa0al., 2021).3.3.5 Ecological Response to Multiple Drivers\nAssessing ecological responses to multiple climate-induced drivers \nrequires a combination of approaches, including laboratory- and \nfield-based experiments, field observations (e.g., natural gradients, \nclimate analogues), study of paleo-analogues and the development \nof mechanistic and empirical models (Clapham, 2019; Gissi et\xa0 al., \n2021).Experimental studies of food-web responses are often limited \nto an individual driver, although recent manipulations have used a \nmatrix of >1000-l mesocosms to explore ecological responses to both \nwarming and acidification (see Box\xa0 3.1; Nagelkerken et\xa0 al., 2020).Hence, complementary approaches are needed to indirectly explore \nthe mechanisms underlying ecosystem responses to global climate \nchange (Parmesan et\xa0al., 2013).Observations from time series longer \nthan modes of natural variability (i.e., decades) are essential for \nrevealing and attributing ecological responses to climate change (e.g., \nSection\xa03.4; Barton et\xa0al., 2015b; Brun et\xa0al., 2019).Also, paleorecords \nprovide insights into the influence of multiple drivers on marine \nbiota (Cross-Chapter Box\xa0 PALEO in Chapter\xa0 1; Reddin et\xa0 al., 2020).Specifically, associations between vulnerabilities and traits of marine \nectotherms in laboratory experiments correspond with organismal \nresponses to ancient hyperthermal events (medium confidence) \n(Reddin et\xa0 al., 2020).This corroboration suggests that responses to \nmultiple drivers inferred from the fossil record can help provide insights \ninto the future status of functional groups, and hence food webs, under \nrapid climate change.Multi-species and integrated end-to-end ecosystem models are \npowerful tools to explore and project outcomes to the often-interacting \ncumulative effects of climate change and other anthropogenic drivers \n(Section\xa03.1; Kaplan and Marshall, 2016; Koenigstein et\xa0al., 2016; Peck \nand Pinnegar, 2018; Tittensor et\xa0 al., 2018; Gissi et\xa0 al., 2021).These \nmodels can integrate some aspects of the knowledge accrued from \nmanipulation experiments, paleo- and contemporary observations, help \ntest the relative importance of specific drivers and driver combinations, \nand identify synergistic or antagonistic responses (Koenigstein et\xa0al., \n2016; Payne et\xa0al., 2016; Skogen et\xa0al., 2018; Tittensor et\xa0al., 2018).As these models are associated with wide-ranging uncertainties \n(SM3.2.2; Payne et\xa0 al., 2016; Trolle et\xa0 al., 2019; Heneghan et\xa0 al., \n2021), they cannot be expected to accurately project the trajectories \nof complex marine ecosystems under climate change; hence, they are \nmost useful for assessing overall trends and in particular for providing a plausible envelope of trajectories across a range of assumptions \n(Fulton et\xa0al., 2018; Peck et\xa0al., 2018; Tittensor et\xa0al., 2018).\n---------------------\nGiven the context information and not prior knowledge, answer the query.\nQuery: What are some approaches used to assess ecological responses to multiple climate-induced drivers in the context of climate change and the oceans?\nAnswer: '} {'role': 'assistant', 'content': 'Several approaches are used to assess ecological responses to multiple climate-induced drivers. These include laboratory- and field-based experiments, field observations such as natural gradients and climate analogues, the study of paleo-analogues, and the development of mechanistic and empirical models. Experimental studies often focus on individual drivers, but recent manipulations have used large-volume mesocosms to explore ecological responses to both warming and acidification. Observations from time series longer than modes of natural variability are essential for revealing and attributing ecological responses to climate change. Paleorecords also provide insights into the influence of multiple drivers on marine biota. Multi-species and integrated end-to-end ecosystem models are powerful tools to explore and project outcomes to the often-interacting cumulative effects of climate change and other anthropogenic drivers. These models can integrate some aspects of the knowledge accrued from manipulation experiments, paleo- and contemporary observations, help test the relative importance of specific drivers and driver combinations, and identify synergistic or antagonistic responses.'} No errors found Num examples missing system message: 21 Num examples missing user message: 0 #### Distribution of num_messages_per_example: min / max: 2, 3 mean / median: 2.6557377049180326, 3.0 p5 / p95: 2.0, 3.0 #### Distribution of num_total_tokens_per_example: min / max: 229, 2011 mean / median: 1274.27868852459, 1385.0 p5 / p95: 533.0, 1848.0 #### Distribution of num_assistant_tokens_per_example: min / max: 11, 334 mean / median: 72.36065573770492, 37.0 p5 / p95: 23.0, 193.0 0 examples may be over the 4096 token limit, they will be truncated during fine-tuning Dataset has ~77731 tokens that will be charged for during training By default, you'll train for 3 epochs on this dataset By default, you'll be charged for ~233193 tokens As of Augest 22, 2023, fine-tuning gpt-3.5-turbo is $0.008 / 1K Tokens. This means your total cost for training will be $0.621848 per epoch. Waiting for file to be ready...
finetune_engine.get_current_job()
<FineTuningJob fine_tuning.job id=ftjob-u9T7BF5zRxVX4n5b9Jtbb5cR at 0x2c641fe20> JSON: { "object": "fine_tuning.job", "id": "ftjob-u9T7BF5zRxVX4n5b9Jtbb5cR", "model": "gpt-3.5-turbo-0613", "created_at": 1693254044, "finished_at": null, "fine_tuned_model": null, "organization_id": "org-1ZDAvajC6v2ZtAP9hLEIsXRz", "result_files": [], "status": "running", "validation_file": null, "training_file": "file-j1fwmqIAoqZXWZQ8EqwHucXs", "hyperparameters": { "n_epochs": 3 }, "trained_tokens": null }
ft_llm = finetune_engine.get_finetuned_model(temperature=0.3)
from llama_index.llms.openai import OpenAI
from llama_index.finetuning.callbacks import OpenAIFineTuningHandler
from llama_index.core.callbacks import CallbackManager
# Option 1: pass in ft_llm directly into Settings
from llama_index.core import Settings
Settings.llm = ft_llm
Settings.context_window = (
2048 # limit the context window artifically to test refine process
)
# # Option 2: you can also specify the model name manually
# ft_model_name = "ft:gpt-3.5-turbo-0613:..."
# Settings.llm = OpenAI(model=ft_model_name, temperature=0.3)
questions = []
with open("eval_questions.txt", "r") as f:
for line in f:
questions.append(line.strip())
from llama_index.core import VectorStoreIndex
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(similarity_top_k=2, llm=ft_llm)
contexts = []
answers = []
for question in questions:
response = query_engine.query(question)
contexts.append([x.node.get_content() for x in response.source_nodes])
answers.append(str(response))
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import answer_relevancy, faithfulness
ds = Dataset.from_dict(
{
"question": questions,
"answer": answers,
"contexts": contexts,
}
)
result = evaluate(ds, [answer_relevancy, faithfulness])
print(result)
evaluating with [answer_relevancy]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:49<00:00, 16.34s/it]
evaluating with [faithfulness]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [04:04<00:00, 81.44s/it]
{'ragas_score': 0.8680, 'answer_relevancy': 0.9607, 'faithfulness': 0.7917}
探索差异¶
让我们快速比较一下响应的差异,以证明微调确实改变了一些东西。
from llama_index.core import VectorStoreIndex
index = VectorStoreIndex.from_documents(documents)
questions = []
with open("eval_questions.txt", "r") as f:
for line in f:
questions.append(line.strip())
print(questions[12])
What is a key barrier globally for ocean health, governance, and adaptation to climate change, according to the report?
原始¶
from llama_index.core.response.notebook_utils import display_response
from llama_index.llms.openai import OpenAI
gpt_35_llm = OpenAI(model="gpt-3.5-turbo", temperature=0.3)
query_engine = index.as_query_engine(llm=gpt_35_llm)
response = query_engine.query(questions[12])
display_response(response)
最终响应:
根据报告,全球海洋健康、治理和适应气候变化的一个关键障碍是技术、知识和财政支持的可获得性,以及现有的治理结构。
微调后¶
query_engine = index.as_query_engine(llm=ft_llm)
response = query_engine.query(questions[12])
display_response(response)
最终响应:
报告指出了生态系统和人类系统适应气候变化的广泛障碍和限制。这些障碍包括技术、知识和财政支持的可获得性,以及现有的治理结构。现有的海洋治理结构已经由于气候变化而面临多维度、尺度相关的挑战。
正如我们所见,微调后的模型提供了更全面的响应!这与 ragas 的 faithfulness 分数提高相符,因为答案更能代表检索到的上下文。
结论¶
因此,总之,仅用约 61 个问题进行微调确实帮助提高了我们的评估分数!
answer_relevancy:0.9725 -> 0.9607
答案相关性略有下降,但幅度非常小。
faithfulness:0.7325 -> 0.7917
Faithfulness 似乎得到了改善!这意味着给出的答案更好地满足了原始问题。