BEIR 域外基准测试¶

关于 BEIR

BEIR 是一个包含各种 IR 任务的异构基准测试。它还提供了一个通用且简单的框架，用于评估基准测试中的检索方法。

请通过链接查看仓库，获取支持数据集的完整列表。

在这里，我们测试了 all-MiniLM-L6-v2 sentence-transformer 嵌入，它是给定精度范围内最快的嵌入之一。我们将检索器的 top_k 值设置为 30。我们还使用了 nfcorpus 数据集。

如果您在 colab 上打开此 Notebook，则可能需要安装 LlamaIndex 🦙。

In [ ]

已复制！

%pip install llama-index-embeddings-huggingface
%pip install llama-index-embeddings-huggingface

In [ ]

已复制！

!pip install llama-index
!pip install llama-index

In [ ]

已复制！





from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.evaluation.benchmarks import BeirEvaluator
from llama_index.core import VectorStoreIndex


def create_retriever(documents):
    embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
    index = VectorStoreIndex.from_documents(
        documents, embed_model=embed_model, show_progress=True
    )
    return index.as_retriever(similarity_top_k=30)


BeirEvaluator().run(
    create_retriever, datasets=["nfcorpus"], metrics_k_values=[3, 10, 30]
)
from llama_index.embeddings.huggingface import HuggingFaceEmbedding from llama_index.core.evaluation.benchmarks import BeirEvaluator from llama_index.core import VectorStoreIndex def create_retriever(documents): embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5") index = VectorStoreIndex.from_documents( documents, embed_model=embed_model, show_progress=True ) return index.as_retriever(similarity_top_k=30) BeirEvaluator().run( create_retriever, datasets=["nfcorpus"], metrics_k_values=[3, 10, 30] )

/home/jonch/.pyenv/versions/3.10.6/lib/python3.10/site-packages/beir/datasets/data_loader.py:2: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from tqdm.autonotebook import tqdm

Dataset: nfcorpus downloaded at: /home/jonch/.cache/llama_index/datasets/BeIR__nfcorpus
Evaluating on dataset: nfcorpus
-------------------------------------

100%|███████████████████████████████████| 3633/3633 [00:00<00:00, 141316.79it/s]
Parsing documents into nodes: 100%|████████| 3633/3633 [00:06<00:00, 569.35it/s]
Generating embeddings: 100%|████████████████| 3649/3649 [04:22<00:00, 13.92it/s]

Retriever created for:  nfcorpus
Evaluating retriever on questions against qrels

100%|█████████████████████████████████████████| 323/323 [01:26<00:00,  3.74it/s]

Results for: nfcorpus
{'NDCG@3': 0.35476, 'MAP@3': 0.07489, 'Recall@3': 0.08583, 'precision@3': 0.33746}
{'NDCG@10': 0.31403, 'MAP@10': 0.11003, 'Recall@10': 0.15885, 'precision@10': 0.23994}
{'NDCG@30': 0.28636, 'MAP@30': 0.12794, 'Recall@30': 0.21653, 'precision@30': 0.14716}
-------------------------------------

对于所有评估指标，值越高越好。

这篇 towardsdatascience 文章更深入地介绍了 NDCG、MAP 和 MRR。