HotpotQA 干扰项演示¶
本 Notebook 演示了如何使用 HotpotQA 数据集评估查询引擎。在此任务中,LLM 必须根据预配置的上下文回答问题。答案通常需要简洁,准确性通过计算重叠(由 F1 衡量)和精确匹配来衡量。
如果您在 Colab 上打开此 Notebook,您可能需要安装 LlamaIndex 🦙。
输入 [ ]
已复制!
%pip install llama-index-llms-openai
%pip install llama-index-llms-openai
输入 [ ]
已复制!
!pip install llama-index
!pip install llama-index
输入 [ ]
已复制!
from llama_index.core.evaluation.benchmarks import HotpotQAEvaluator
from llama_index.core import VectorStoreIndex
from llama_index.core import Document
from llama_index.llms.openai import OpenAI
from llama_index.core.embeddings import resolve_embed_model
llm = OpenAI(model="gpt-3.5-turbo")
embed_model = resolve_embed_model(
"local:sentence-transformers/all-MiniLM-L6-v2"
)
index = VectorStoreIndex.from_documents(
[Document.example()], embed_model=embed_model, show_progress=True
)
from llama_index.core.evaluation.benchmarks import HotpotQAEvaluator from llama_index.core import VectorStoreIndex from llama_index.core import Document from llama_index.llms.openai import OpenAI from llama_index.core.embeddings import resolve_embed_model llm = OpenAI(model="gpt-3.5-turbo") embed_model = resolve_embed_model( "local:sentence-transformers/all-MiniLM-L6-v2" ) index = VectorStoreIndex.from_documents( [Document.example()], embed_model=embed_model, show_progress=True )
Parsing documents into nodes: 100%|██████████| 1/1 [00:00<00:00, 129.13it/s] Generating embeddings: 100%|██████████| 1/1 [00:00<00:00, 36.62it/s]
首先,我们尝试一个非常简单的引擎。在此特定基准测试中,检索器以及索引实际上被忽略了,因为每个查询检索到的文档都在数据集中提供。这在 HotpotQA 中被称为“干扰项”设置。
输入 [ ]
已复制!
engine = index.as_query_engine(llm=llm)
HotpotQAEvaluator().run(engine, queries=5, show_result=True)
engine = index.as_query_engine(llm=llm) HotpotQAEvaluator().run(engine, queries=5, show_result=True)
Dataset: hotpot_dev_distractor downloaded at: /Users/loganmarkewich/Library/Caches/llama_index/datasets/HotpotQA Evaluating on dataset: hotpot_dev_distractor ------------------------------------- Loading 5 queries out of 7405 (fraction: 0.00068) Question: Were Scott Derrickson and Ed Wood of the same nationality? Response: No. Correct answer: yes EM: 0 F1: 0 ------------------------------------- Question: What government position was held by the woman who portrayed Corliss Archer in the film Kiss and Tell? Response: Unknown Correct answer: Chief of Protocol EM: 0 F1: 0 ------------------------------------- Question: What science fantasy young adult series, told in first person, has a set of companion books narrating the stories of enslaved worlds and alien species? Response: Animorphs Correct answer: Animorphs EM: 1 F1: 1.0 ------------------------------------- Question: Are the Laleli Mosque and Esma Sultan Mansion located in the same neighborhood? Response: Yes. Correct answer: no EM: 0 F1: 0 ------------------------------------- Question: The director of the romantic comedy "Big Stone Gap" is based in what New York city? Response: Greenwich Village Correct answer: Greenwich Village, New York City EM: 0 F1: 0.5714285714285715 ------------------------------------- Scores: {'exact_match': 0.2, 'f1': 0.31428571428571433}
现在我们尝试一个句子 Transformer 重排器,它从检索器提出的 10 个节点中选择 3 个
输入 [ ]
已复制!
from llama_index.core.postprocessor import SentenceTransformerRerank
rerank = SentenceTransformerRerank(top_n=3)
engine = index.as_query_engine(
llm=llm,
node_postprocessors=[rerank],
)
HotpotQAEvaluator().run(engine, queries=5, show_result=True)
from llama_index.core.postprocessor import SentenceTransformerRerank rerank = SentenceTransformerRerank(top_n=3) engine = index.as_query_engine( llm=llm, node_postprocessors=[rerank], ) HotpotQAEvaluator().run(engine, queries=5, show_result=True)
Dataset: hotpot_dev_distractor downloaded at: /Users/loganmarkewich/Library/Caches/llama_index/datasets/HotpotQA Evaluating on dataset: hotpot_dev_distractor ------------------------------------- Loading 5 queries out of 7405 (fraction: 0.00068) Question: Were Scott Derrickson and Ed Wood of the same nationality? Response: No. Correct answer: yes EM: 0 F1: 0 ------------------------------------- Question: What government position was held by the woman who portrayed Corliss Archer in the film Kiss and Tell? Response: No government position. Correct answer: Chief of Protocol EM: 0 F1: 0 ------------------------------------- Question: What science fantasy young adult series, told in first person, has a set of companion books narrating the stories of enslaved worlds and alien species? Response: Animorphs Correct answer: Animorphs EM: 1 F1: 1.0 ------------------------------------- Question: Are the Laleli Mosque and Esma Sultan Mansion located in the same neighborhood? Response: No. Correct answer: no EM: 1 F1: 1.0 ------------------------------------- Question: The director of the romantic comedy "Big Stone Gap" is based in what New York city? Response: New York City. Correct answer: Greenwich Village, New York City EM: 0 F1: 0.7499999999999999 ------------------------------------- Scores: {'exact_match': 0.4, 'f1': 0.55}
F1 和精确匹配得分似乎略有提高。
注意,该基准测试优化于生成简短的事实性答案,不带解释,尽管已知 CoT 提示有时有助于提高输出质量。
使用的得分也不是衡量正确性的完美指标,但可以快速识别查询引擎的更改如何影响输出。