在任何黑盒嵌入模型之上微调适配器¶
LlamaIndex 提供了在任何模型(sentence_transformers、OpenAI 等)生成的嵌入之上微调适配器的功能。
这使您能够将嵌入表示转换为一个新的潜在空间,该空间针对您特定数据和查询的检索进行了优化。这可以稍微提高检索性能,进而使 RAG 系统表现更好。
我们通过 `EmbeddingAdapterFinetuneEngine` 抽象来实现此功能。我们微调三种类型的适配器
- Linear
- 2层神经网络
- 自定义神经网络
生成语料库¶
我们使用辅助抽象 `generate_qa_embedding_pairs` 来生成我们的训练和评估数据集。此函数接受任何一组文本节点(块),并生成包含(问题、上下文)对的结构化数据集。
%pip install llama-index-embeddings-openai
%pip install llama-index-embeddings-adapter
%pip install llama-index-finetuning
import json
from llama_index.core import SimpleDirectoryReader
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.schema import MetadataMode
下载数据
!mkdir -p 'data/10k/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10k/uber_2021.pdf' -O 'data/10k/uber_2021.pdf'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10k/lyft_2021.pdf' -O 'data/10k/lyft_2021.pdf'
TRAIN_FILES = ["./data/10k/lyft_2021.pdf"]
VAL_FILES = ["./data/10k/uber_2021.pdf"]
TRAIN_CORPUS_FPATH = "./data/train_corpus.json"
VAL_CORPUS_FPATH = "./data/val_corpus.json"
def load_corpus(files, verbose=False):
if verbose:
print(f"Loading files {files}")
reader = SimpleDirectoryReader(input_files=files)
docs = reader.load_data()
if verbose:
print(f"Loaded {len(docs)} docs")
parser = SentenceSplitter()
nodes = parser.get_nodes_from_documents(docs, show_progress=verbose)
if verbose:
print(f"Parsed {len(nodes)} nodes")
return nodes
我们进行了一个非常简单的训练/验证集划分,将 Lyft 语料库作为训练数据集,将 Uber 语料库作为验证数据集。
train_nodes = load_corpus(TRAIN_FILES, verbose=True)
val_nodes = load_corpus(VAL_FILES, verbose=True)
Loading files ['../../../examples/data/10k/lyft_2021.pdf'] Loaded 238 docs
Parsing documents into nodes: 0%| | 0/238 [00:00<?, ?it/s]
Parsed 349 nodes Loading files ['../../../examples/data/10k/uber_2021.pdf'] Loaded 307 docs
Parsing documents into nodes: 0%| | 0/307 [00:00<?, ?it/s]
Parsed 418 nodes
生成合成查询¶
现在,我们使用一个大型语言模型(gpt-3.5-turbo)以语料库中的每个文本块作为上下文来生成问题。
每对(生成的问题,用作上下文的文本块)成为微调数据集中的一个数据点(用于训练或评估)。
from llama_index.finetuning import generate_qa_embedding_pairs
from llama_index.core.evaluation import EmbeddingQAFinetuneDataset
train_dataset = generate_qa_embedding_pairs(train_nodes)
val_dataset = generate_qa_embedding_pairs(val_nodes)
train_dataset.save_json("train_dataset.json")
val_dataset.save_json("val_dataset.json")
# [Optional] Load
train_dataset = EmbeddingQAFinetuneDataset.from_json("train_dataset.json")
val_dataset = EmbeddingQAFinetuneDataset.from_json("val_dataset.json")
运行嵌入微调¶
然后,我们在现有嵌入模型之上微调线性适配器。我们导入新的 `EmbeddingAdapterFinetuneEngine` 抽象,它接受一个现有嵌入模型和一组训练参数。
微调 bge-small-en(默认)¶
from llama_index.finetuning import EmbeddingAdapterFinetuneEngine
from llama_index.core.embeddings import resolve_embed_model
import torch
base_embed_model = resolve_embed_model("local:BAAI/bge-small-en")
finetune_engine = EmbeddingAdapterFinetuneEngine(
train_dataset,
base_embed_model,
model_output_path="model_output_test",
# bias=True,
epochs=4,
verbose=True,
# optimizer_class=torch.optim.SGD,
# optimizer_params={"lr": 0.01}
)
finetune_engine.finetune()
embed_model = finetune_engine.get_finetuned_model()
# alternatively import model
from llama_index.core.embeddings import LinearAdapterEmbeddingModel
# embed_model = LinearAdapterEmbeddingModel(base_embed_model, "model_output_test")
评估微调模型¶
我们将微调模型与基础模型以及 text-embedding-ada-002 进行比较。
我们使用两个排序指标进行评估
- 命中率指标:对于每个(查询,上下文)对,我们使用查询检索 top-k 文档。如果结果包含真实上下文,则视为命中。
- 平均倒数排序(MRR):一个稍微更细粒度的排序指标,它考察真实上下文在检索到的 top-k 结果中的“倒数排序”。倒数排序定义为 1/排序。当然,如果结果不包含上下文,则倒数排序为 0。
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import VectorStoreIndex
from llama_index.core.schema import TextNode
from tqdm.notebook import tqdm
import pandas as pd
from eval_utils import evaluate, display_results
ada = OpenAIEmbedding()
ada_val_results = evaluate(val_dataset, ada)
Generating embeddings: 0%| | 0/395 [00:00<?, ?it/s]
100%|████████████████████████████████████████████████████████████████| 790/790 [03:03<00:00, 4.30it/s]
display_results(["ada"], [ada_val_results])
检索器 | 命中率 | MRR | |
---|---|---|---|
0 | ada | 0.870886 | 0.72884 |
bge = "local:BAAI/bge-small-en"
bge_val_results = evaluate(val_dataset, bge)
Generating embeddings: 0%| | 0/395 [00:00<?, ?it/s]
100%|████████████████████████████████████████████████████████████████| 790/790 [00:23<00:00, 33.76it/s]
display_results(["bge"], [bge_val_results])
检索器 | 命中率 | MRR | |
---|---|---|---|
0 | bge | 0.787342 | 0.643038 |
ft_val_results = evaluate(val_dataset, embed_model)
Generating embeddings: 0%| | 0/395 [00:00<?, ?it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 790/790 [00:21<00:00, 36.95it/s]
display_results(["ft"], [ft_val_results])
检索器 | 命中率 | MRR | |
---|---|---|---|
0 | ft | 0.798734 | 0.662152 |
这里我们展示所有结果的串联。
display_results(
["ada", "bge", "ft"], [ada_val_results, bge_val_results, ft_val_results]
)
检索器 | 命中率 | MRR | |
---|---|---|---|
0 | ada | 0.870886 | 0.730105 |
1 | bge | 0.787342 | 0.643038 |
2 | ft | 0.798734 | 0.662152 |
微调双层适配器¶
我们也来试试微调一个双层神经网络!
这是一个简单的双层神经网络,带有 ReLU 激活函数和末尾的残差层。
我们训练 25 个 epoch - 比线性适配器更长 - 并且每 100 步保留一个检查点。
# requires torch dependency
from llama_index.core.embeddings.adapter_utils import TwoLayerNN
from llama_index.finetuning import EmbeddingAdapterFinetuneEngine
from llama_index.core.embeddings import resolve_embed_model
from llama_index.embeddings.adapter import AdapterEmbeddingModel
base_embed_model = resolve_embed_model("local:BAAI/bge-small-en")
adapter_model = TwoLayerNN(
384, # input dimension
1024, # hidden dimension
384, # output dimension
bias=True,
add_residual=True,
)
finetune_engine = EmbeddingAdapterFinetuneEngine(
train_dataset,
base_embed_model,
model_output_path="model5_output_test",
model_checkpoint_path="model5_ck",
adapter_model=adapter_model,
epochs=25,
verbose=True,
)
finetune_engine.finetune()
embed_model_2layer = finetune_engine.get_finetuned_model(
adapter_cls=TwoLayerNN
)
评估结果¶
运行与上一节相同的评估脚本,以衡量双层模型内的命中率/MRR。
# load model from checkpoint in the midde
embed_model_2layer = AdapterEmbeddingModel(
base_embed_model,
"model5_output_test",
TwoLayerNN,
)
from eval_utils import evaluate, display_results
ft_val_results_2layer = evaluate(val_dataset, embed_model_2layer)
Generating embeddings: 0%| | 0/395 [00:00<?, ?it/s]
100%|████████████████████████████████████████████████████████████████| 790/790 [00:21<00:00, 36.93it/s]
# comment out if you haven't run ada/bge yet
display_results(
["ada", "bge", "ft_2layer"],
[ada_val_results, bge_val_results, ft_val_results_2layer],
)
# uncomment if you just want to display the fine-tuned model's results
# display_results(["ft_2layer"], [ft_val_results_2layer])
检索器 | 命中率 | MRR | |
---|---|---|---|
0 | ada | 0.870886 | 0.728840 |
1 | bge | 0.787342 | 0.643038 |
2 | ft_2layer | 0.798734 | 0.662848 |
# load model from checkpoint in the midde
embed_model_2layer_s900 = AdapterEmbeddingModel(
base_embed_model,
"model5_ck/step_900",
TwoLayerNN,
)
ft_val_results_2layer_s900 = evaluate(val_dataset, embed_model_2layer_s900)
Generating embeddings: 0%| | 0/395 [00:00<?, ?it/s]
100%|████████████████████████████████████████████████████████████████| 790/790 [00:19<00:00, 40.57it/s]
# comment out if you haven't run ada/bge yet
display_results(
["ada", "bge", "ft_2layer_s900"],
[ada_val_results, bge_val_results, ft_val_results_2layer_s900],
)
# uncomment if you just want to display the fine-tuned model's results
# display_results(["ft_2layer_s900"], [ft_val_results_2layer_s900])
检索器 | 命中率 | MRR | |
---|---|---|---|
0 | ada | 0.870886 | 0.728840 |
1 | bge | 0.787342 | 0.643038 |
2 | ft_2layer_s900 | 0.803797 | 0.667426 |
尝试您自己的自定义模型¶
您可以在这里定义自己的自定义适配器!只需继承 `BaseAdapter` 类,它是 `nn.Module` 类的一个轻量级包装器。
您只需重写 `forward` 和 `get_config_dict` 方法。
只需确保您熟悉编写 `PyTorch` 代码 :)
from llama_index.core.embeddings.adapter_utils import BaseAdapter
import torch.nn.functional as F
from torch import nn, Tensor
from typing import Dict
class CustomNN(BaseAdapter):
"""Custom NN transformation.
Is a copy of our TwoLayerNN, showing it here for notebook purposes.
Args:
in_features (int): Input dimension.
hidden_features (int): Hidden dimension.
out_features (int): Output dimension.
bias (bool): Whether to use bias. Defaults to False.
activation_fn_str (str): Name of activation function. Defaults to "relu".
"""
def __init__(
self,
in_features: int,
hidden_features: int,
out_features: int,
bias: bool = False,
add_residual: bool = False,
) -> None:
super(CustomNN, self).__init__()
self.in_features = in_features
self.hidden_features = hidden_features
self.out_features = out_features
self.bias = bias
self.linear1 = nn.Linear(in_features, hidden_features, bias=True)
self.linear2 = nn.Linear(hidden_features, out_features, bias=True)
self._add_residual = add_residual
# if add_residual, then add residual_weight (init to 0)
self.residual_weight = nn.Parameter(torch.zeros(1))
def forward(self, embed: Tensor) -> Tensor:
"""Forward pass (Wv).
Args:
embed (Tensor): Input tensor.
"""
output1 = self.linear1(embed)
output1 = F.relu(output1)
output2 = self.linear2(output1)
if self._add_residual:
output2 = self.residual_weight * output2 + embed
return output2
def get_config_dict(self) -> Dict:
"""Get config dict."""
return {
"in_features": self.in_features,
"hidden_features": self.hidden_features,
"out_features": self.out_features,
"bias": self.bias,
"add_residual": self._add_residual,
}
custom_adapter = CustomNN(
384, # input dimension
1024, # hidden dimension
384, # output dimension
bias=True,
add_residual=True,
)
finetune_engine = EmbeddingAdapterFinetuneEngine(
train_dataset,
base_embed_model,
model_output_path="custom_model_output",
model_checkpoint_path="custom_model_ck",
adapter_model=custom_adapter,
epochs=25,
verbose=True,
)
finetune_engine.finetune()
embed_model_custom = finetune_engine.get_finetuned_model(
adapter_cls=CustomAdapter
)
评估结果¶
运行与上一节相同的评估脚本,以衡量命中率/MRR。
# [optional] load model manually
# embed_model_custom = AdapterEmbeddingModel(
# base_embed_model,
# "custom_model_ck/step_300",
# TwoLayerNN,
# )
from eval_utils import evaluate, display_results
ft_val_results_custom = evaluate(val_dataset, embed_model_custom)
Generating embeddings: 0%| | 0/395 [00:00<?, ?it/s]
100%|████████████████████████████████████████████████████████████████| 790/790 [00:20<00:00, 37.77it/s]
display_results(["ft_custom"]x, [ft_val_results_custom])
检索器 | 命中率 | MRR | |
---|---|---|---|
0 | ft_custom | 0.789873 | 0.645127 |