微调嵌入¶
在本 notebook 中,我们将向用户展示如何微调自己的嵌入模型。
我们将介绍三个主要部分:
- 准备数据(我们的
generate_qa_embedding_pairs
函数使其变得容易) - 微调模型(使用我们的
SentenceTransformersFinetuneEngine
) - 在验证知识语料库上评估模型
生成语料库¶
首先,我们利用 LlamaIndex 加载一些金融 PDF,并将其解析/分块成纯文本块,从而创建文本块的语料库。
输入 [ ]
已复制!
%pip install datasets
%pip install llama-index-llms-openai
%pip install llama-index-embeddings-openai
%pip install llama-index-finetuning
%pip install llama-index-readers-file
%pip install llama-index-embeddings-huggingface
%pip install "transformers[torch]"
%pip install datasets %pip install llama-index-llms-openai %pip install llama-index-embeddings-openai %pip install llama-index-finetuning %pip install llama-index-readers-file %pip install llama-index-embeddings-huggingface %pip install "transformers[torch]"
输入 [ ]
已复制!
import json
from llama_index.core import SimpleDirectoryReader
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.schema import MetadataMode
import json from llama_index.core import SimpleDirectoryReader from llama_index.core.node_parser import SentenceSplitter from llama_index.core.schema import MetadataMode
下载数据
输入 [ ]
已复制!
!mkdir -p 'data/10k/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10k/uber_2021.pdf' -O 'data/10k/uber_2021.pdf'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10k/lyft_2021.pdf' -O 'data/10k/lyft_2021.pdf'
!mkdir -p 'data/10k/' !wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10k/uber_2021.pdf' -O 'data/10k/uber_2021.pdf' !wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10k/lyft_2021.pdf' -O 'data/10k/lyft_2021.pdf'
输入 [ ]
已复制!
TRAIN_FILES = ["./data/10k/lyft_2021.pdf"]
VAL_FILES = ["./data/10k/uber_2021.pdf"]
TRAIN_CORPUS_FPATH = "./data/train_corpus.json"
VAL_CORPUS_FPATH = "./data/val_corpus.json"
TRAIN_FILES = ["./data/10k/lyft_2021.pdf"] VAL_FILES = ["./data/10k/uber_2021.pdf"] TRAIN_CORPUS_FPATH = "./data/train_corpus.json" VAL_CORPUS_FPATH = "./data/val_corpus.json"
输入 [ ]
已复制!
def load_corpus(files, verbose=False):
if verbose:
print(f"Loading files {files}")
reader = SimpleDirectoryReader(input_files=files)
docs = reader.load_data()
if verbose:
print(f"Loaded {len(docs)} docs")
parser = SentenceSplitter()
nodes = parser.get_nodes_from_documents(docs, show_progress=verbose)
if verbose:
print(f"Parsed {len(nodes)} nodes")
return nodes
def load_corpus(files, verbose=False): if verbose: print(f"Loading files {files}") reader = SimpleDirectoryReader(input_files=files) docs = reader.load_data() if verbose: print(f"Loaded {len(docs)} docs") parser = SentenceSplitter() nodes = parser.get_nodes_from_documents(docs, show_progress=verbose) if verbose: print(f"Parsed {len(nodes)} nodes") return nodes
我们通过将 Lyft 语料库作为训练数据集,将 Uber 语料库作为验证数据集,进行非常简单的训练/验证集划分。
输入 [ ]
已复制!
train_nodes = load_corpus(TRAIN_FILES, verbose=True)
val_nodes = load_corpus(VAL_FILES, verbose=True)
train_nodes = load_corpus(TRAIN_FILES, verbose=True) val_nodes = load_corpus(VAL_FILES, verbose=True)
Loading files ['./data/10k/lyft_2021.pdf'] Loaded 238 docs
Parsing nodes: 0%| | 0/238 [00:00<?, ?it/s]
Parsed 344 nodes Loading files ['./data/10k/uber_2021.pdf'] Loaded 307 docs
Parsing nodes: 0%| | 0/307 [00:00<?, ?it/s]
Parsed 410 nodes
生成合成查询¶
现在,我们使用一个大型语言模型 (LLM) (gpt-3.5-turbo) 以语料库中的每个文本块作为上下文来生成问题。
每对(生成的问题,用作上下文的文本块)都成为微调数据集(用于训练或评估)中的一个数据点。
输入 [ ]
已复制!
from llama_index.finetuning import generate_qa_embedding_pairs
from llama_index.core.evaluation import EmbeddingQAFinetuneDataset
from llama_index.finetuning import generate_qa_embedding_pairs from llama_index.core.evaluation import EmbeddingQAFinetuneDataset
输入 [ ]
已复制!
import os
OPENAI_API_KEY = "sk-"
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
import os OPENAI_API_KEY = "sk-" os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
输入 [ ]
已复制!
from llama_index.llms.openai import OpenAI
train_dataset = generate_qa_embedding_pairs(
llm=OpenAI(model="gpt-3.5-turbo"),
nodes=train_nodes,
output_path="train_dataset.json",
)
val_dataset = generate_qa_embedding_pairs(
llm=OpenAI(model="gpt-3.5-turbo"),
nodes=val_nodes,
output_path="val_dataset.json",
)
from llama_index.llms.openai import OpenAI train_dataset = generate_qa_embedding_pairs( llm=OpenAI(model="gpt-3.5-turbo"), nodes=train_nodes, output_path="train_dataset.json", ) val_dataset = generate_qa_embedding_pairs( llm=OpenAI(model="gpt-3.5-turbo"), nodes=val_nodes, output_path="val_dataset.json", )
100%|██████████| 344/344 [12:51<00:00, 2.24s/it] 100%|██████████| 410/410 [16:07<00:00, 2.36s/it]
输入 [ ]
已复制!
# [Optional] Load
train_dataset = EmbeddingQAFinetuneDataset.from_json("train_dataset.json")
val_dataset = EmbeddingQAFinetuneDataset.from_json("val_dataset.json")
# [Optional] Load train_dataset = EmbeddingQAFinetuneDataset.from_json("train_dataset.json") val_dataset = EmbeddingQAFinetuneDataset.from_json("val_dataset.json")
运行嵌入微调¶
输入 [ ]
已复制!
from llama_index.finetuning import SentenceTransformersFinetuneEngine
from llama_index.finetuning import SentenceTransformersFinetuneEngine
输入 [ ]
已复制!
finetune_engine = SentenceTransformersFinetuneEngine(
train_dataset,
model_id="BAAI/bge-small-en",
model_output_path="test_model",
val_dataset=val_dataset,
)
finetune_engine = SentenceTransformersFinetuneEngine( train_dataset, model_id="BAAI/bge-small-en", model_output_path="test_model", val_dataset=val_dataset, )
.gitattributes: 0%| | 0.00/1.52k [00:00<?, ?B/s]
1_Pooling/config.json: 0%| | 0.00/190 [00:00<?, ?B/s]
README.md: 0%| | 0.00/90.8k [00:00<?, ?B/s]
config.json: 0%| | 0.00/684 [00:00<?, ?B/s]
config_sentence_transformers.json: 0%| | 0.00/124 [00:00<?, ?B/s]
model.safetensors: 0%| | 0.00/133M [00:00<?, ?B/s]
pytorch_model.bin: 0%| | 0.00/134M [00:00<?, ?B/s]
sentence_bert_config.json: 0%| | 0.00/52.0 [00:00<?, ?B/s]
special_tokens_map.json: 0%| | 0.00/125 [00:00<?, ?B/s]
tokenizer.json: 0%| | 0.00/711k [00:00<?, ?B/s]
tokenizer_config.json: 0%| | 0.00/366 [00:00<?, ?B/s]
vocab.txt: 0%| | 0.00/232k [00:00<?, ?B/s]
modules.json: 0%| | 0.00/349 [00:00<?, ?B/s]
输入 [ ]
已复制!
finetune_engine.finetune()
finetune_engine.finetune()
Epoch: 0%| | 0/2 [00:00<?, ?it/s]
Iteration: 0%| | 0/69 [00:00<?, ?it/s]
Iteration: 0%| | 0/69 [00:00<?, ?it/s]
输入 [ ]
已复制!
embed_model = finetune_engine.get_finetuned_model()
embed_model = finetune_engine.get_finetuned_model()
输入 [ ]
已复制!
embed_model
embed_model
输出 [ ]
HuggingFaceEmbedding(model_name='test_model', embed_batch_size=10, callback_manager=<llama_index.callbacks.base.CallbackManager object at 0x2cc3d5cd0>, tokenizer_name='test_model', max_length=512, pooling=<Pooling.CLS: 'cls'>, normalize=True, query_instruction=None, text_instruction=None, cache_folder=None)
评估微调模型¶
在本节中,我们评估 3 个不同的嵌入模型:
- 专有的 OpenAI 嵌入,
- 开源模型
BAAI/bge-small-en
,以及 - 我们微调的嵌入模型。
我们考虑 2 种评估方法:
- 一个简单的自定义命中率指标
- 使用 sentence_transformers 中的
InformationRetrievalEvaluator
我们表明,在合成(LLM 生成的)数据集上进行微调可以显著改善开源嵌入模型的性能。
输入 [ ]
已复制!
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import VectorStoreIndex
from llama_index.core.schema import TextNode
from tqdm.notebook import tqdm
import pandas as pd
from llama_index.embeddings.openai import OpenAIEmbedding from llama_index.core import VectorStoreIndex from llama_index.core.schema import TextNode from tqdm.notebook import tqdm import pandas as pd
定义评估函数¶
选项 1:我们使用一个简单的命中率指标进行评估
- 对于每个(查询,相关文档)对,
- 我们使用查询检索 top-k 个文档,并且
- 如果结果包含 relevant_doc,则为命中。
这种方法非常简单直观,我们可以将其应用于专有的 OpenAI 嵌入模型以及我们的开源和微调嵌入模型。
输入 [ ]
已复制!
def evaluate(
dataset,
embed_model,
top_k=5,
verbose=False,
):
corpus = dataset.corpus
queries = dataset.queries
relevant_docs = dataset.relevant_docs
nodes = [TextNode(id_=id_, text=text) for id_, text in corpus.items()]
index = VectorStoreIndex(
nodes, embed_model=embed_model, show_progress=True
)
retriever = index.as_retriever(similarity_top_k=top_k)
eval_results = []
for query_id, query in tqdm(queries.items()):
retrieved_nodes = retriever.retrieve(query)
retrieved_ids = [node.node.node_id for node in retrieved_nodes]
expected_id = relevant_docs[query_id][0]
is_hit = expected_id in retrieved_ids # assume 1 relevant doc
eval_result = {
"is_hit": is_hit,
"retrieved": retrieved_ids,
"expected": expected_id,
"query": query_id,
}
eval_results.append(eval_result)
return eval_results
def evaluate( dataset, embed_model, top_k=5, verbose=False, ): corpus = dataset.corpus queries = dataset.queries relevant_docs = dataset.relevant_docs nodes = [TextNode(id_=id_, text=text) for id_, text in corpus.items()] index = VectorStoreIndex( nodes, embed_model=embed_model, show_progress=True ) retriever = index.as_retriever(similarity_top_k=top_k) eval_results = [] for query_id, query in tqdm(queries.items()): retrieved_nodes = retriever.retrieve(query) retrieved_ids = [node.node.node_id for node in retrieved_nodes] expected_id = relevant_docs[query_id][0] is_hit = expected_id in retrieved_ids # assume 1 relevant doc eval_result = { "is_hit": is_hit, "retrieved": retrieved_ids, "expected": expected_id, "query": query_id, } eval_results.append(eval_result) return eval_results
选项 2:我们使用 sentence_transformers 中的 InformationRetrievalEvaluator
。
这提供了一套更全面的指标,但我们只能将其应用于与 sentencetransformers 兼容的模型(开源模型和我们的微调模型,不包括 OpenAI 嵌入模型)。
输入 [ ]
已复制!
from sentence_transformers.evaluation import InformationRetrievalEvaluator
from sentence_transformers import SentenceTransformer
from pathlib import Path
def evaluate_st(
dataset,
model_id,
name,
):
corpus = dataset.corpus
queries = dataset.queries
relevant_docs = dataset.relevant_docs
evaluator = InformationRetrievalEvaluator(
queries, corpus, relevant_docs, name=name
)
model = SentenceTransformer(model_id)
output_path = "results/"
Path(output_path).mkdir(exist_ok=True, parents=True)
return evaluator(model, output_path=output_path)
from sentence_transformers.evaluation import InformationRetrievalEvaluator from sentence_transformers import SentenceTransformer from pathlib import Path def evaluate_st( dataset, model_id, name, ): corpus = dataset.corpus queries = dataset.queries relevant_docs = dataset.relevant_docs evaluator = InformationRetrievalEvaluator( queries, corpus, relevant_docs, name=name ) model = SentenceTransformer(model_id) output_path = "results/" Path(output_path).mkdir(exist_ok=True, parents=True) return evaluator(model, output_path=output_path)
运行评估¶
OpenAI¶
注意:运行此代码可能需要几分钟,因为我们需要嵌入语料库和查询
输入 [ ]
已复制!
ada = OpenAIEmbedding()
ada_val_results = evaluate(val_dataset, ada)
ada = OpenAIEmbedding() ada_val_results = evaluate(val_dataset, ada)
输入 [ ]
已复制!
df_ada = pd.DataFrame(ada_val_results)
df_ada = pd.DataFrame(ada_val_results)
输入 [ ]
已复制!
hit_rate_ada = df_ada["is_hit"].mean()
hit_rate_ada
hit_rate_ada = df_ada["is_hit"].mean() hit_rate_ada
输出 [ ]
0.8779904306220095
BAAI/bge-small-en¶
输入 [ ]
已复制!
bge = "local:BAAI/bge-small-en"
bge_val_results = evaluate(val_dataset, bge)
bge = "local:BAAI/bge-small-en" bge_val_results = evaluate(val_dataset, bge)
Downloading (…)ab102/.gitattributes: 0%| | 0.00/1.52k [00:00<?, ?B/s]
Downloading (…)_Pooling/config.json: 0%| | 0.00/190 [00:00<?, ?B/s]
Downloading (…)2d2d7ab102/README.md: 0%| | 0.00/78.9k [00:00<?, ?B/s]
Downloading (…)2d7ab102/config.json: 0%| | 0.00/684 [00:00<?, ?B/s]
Downloading (…)ce_transformers.json: 0%| | 0.00/124 [00:00<?, ?B/s]
Downloading model.safetensors: 0%| | 0.00/133M [00:00<?, ?B/s]
Downloading pytorch_model.bin: 0%| | 0.00/134M [00:00<?, ?B/s]
Downloading (…)nce_bert_config.json: 0%| | 0.00/52.0 [00:00<?, ?B/s]
Downloading (…)cial_tokens_map.json: 0%| | 0.00/125 [00:00<?, ?B/s]
Downloading (…)ab102/tokenizer.json: 0%| | 0.00/711k [00:00<?, ?B/s]
Downloading (…)okenizer_config.json: 0%| | 0.00/366 [00:00<?, ?B/s]
Downloading (…)2d2d7ab102/vocab.txt: 0%| | 0.00/232k [00:00<?, ?B/s]
Downloading (…)d7ab102/modules.json: 0%| | 0.00/229 [00:00<?, ?B/s]
Generating embeddings: 0%| | 0/418 [00:00<?, ?it/s]
0%| | 0/836 [00:00<?, ?it/s]
输入 [ ]
已复制!
df_bge = pd.DataFrame(bge_val_results)
df_bge = pd.DataFrame(bge_val_results)
输入 [ ]
已复制!
hit_rate_bge = df_bge["is_hit"].mean()
hit_rate_bge
hit_rate_bge = df_bge["is_hit"].mean() hit_rate_bge
输出 [ ]
0.7930622009569378
输入 [ ]
已复制!
evaluate_st(val_dataset, "BAAI/bge-small-en", name="bge")
evaluate_st(val_dataset, "BAAI/bge-small-en", name="bge")
--------------------------------------------------------------------------- FileNotFoundError Traceback (most recent call last) Cell In[59], line 1 ----> 1 evaluate_st(val_dataset, "BAAI/bge-small-en", name='bge') Cell In[49], line 15, in evaluate_st(dataset, model_id, name) 13 evaluator = InformationRetrievalEvaluator(queries, corpus, relevant_docs, name=name) 14 model = SentenceTransformer(model_id) ---> 15 return evaluator(model, output_path='results/') File ~/Programming/gpt_index/.venv/lib/python3.10/site-packages/sentence_transformers/evaluation/InformationRetrievalEvaluator.py:104, in InformationRetrievalEvaluator.__call__(self, model, output_path, epoch, steps, *args, **kwargs) 102 csv_path = os.path.join(output_path, self.csv_file) 103 if not os.path.isfile(csv_path): --> 104 fOut = open(csv_path, mode="w", encoding="utf-8") 105 fOut.write(",".join(self.csv_headers)) 106 fOut.write("\n") FileNotFoundError: [Errno 2] No such file or directory: 'results/Information-Retrieval_evaluation_bge_results.csv'
微调模型¶
输入 [ ]
已复制!
finetuned = "local:test_model"
val_results_finetuned = evaluate(val_dataset, finetuned)
finetuned = "local:test_model" val_results_finetuned = evaluate(val_dataset, finetuned)
输入 [ ]
已复制!
df_finetuned = pd.DataFrame(val_results_finetuned)
df_finetuned = pd.DataFrame(val_results_finetuned)
输入 [ ]
已复制!
hit_rate_finetuned = df_finetuned["is_hit"].mean()
hit_rate_finetuned
hit_rate_finetuned = df_finetuned["is_hit"].mean() hit_rate_finetuned
输入 [ ]
已复制!
evaluate_st(val_dataset, "test_model", name="finetuned")
evaluate_st(val_dataset, "test_model", name="finetuned")
结果摘要¶
命中率¶
输入 [ ]
已复制!
df_ada["model"] = "ada"
df_bge["model"] = "bge"
df_finetuned["model"] = "fine_tuned"
df_ada["model"] = "ada" df_bge["model"] = "bge" df_finetuned["model"] = "fine_tuned"
我们可以看到,微调我们的小型开源嵌入模型极大地提高了其检索质量(甚至接近专有 OpenAI 嵌入模型的质量)!
输入 [ ]
已复制!
df_all = pd.concat([df_ada, df_bge, df_finetuned])
df_all.groupby("model").mean("is_hit")
df_all = pd.concat([df_ada, df_bge, df_finetuned]) df_all.groupby("model").mean("is_hit")
InformationRetrievalEvaluator¶
输入 [ ]
已复制!
df_st_bge = pd.read_csv(
"results/Information-Retrieval_evaluation_bge_results.csv"
)
df_st_finetuned = pd.read_csv(
"results/Information-Retrieval_evaluation_finetuned_results.csv"
)
df_st_bge = pd.read_csv( "results/Information-Retrieval_evaluation_bge_results.csv" ) df_st_finetuned = pd.read_csv( "results/Information-Retrieval_evaluation_finetuned_results.csv" )
我们可以看到,嵌入微调在所有评估指标上都持续提高了性能。
输入 [ ]
已复制!
df_st_bge["model"] = "bge"
df_st_finetuned["model"] = "fine_tuned"
df_st_all = pd.concat([df_st_bge, df_st_finetuned])
df_st_all = df_st_all.set_index("model")
df_st_all
df_st_bge["model"] = "bge" df_st_finetuned["model"] = "fine_tuned" df_st_all = pd.concat([df_st_bge, df_st_finetuned]) df_st_all = df_st_all.set_index("model") df_st_all