相对分数融合和基于分布的分数融合¶
在此示例中,我们将演示如何使用 QueryFusionRetriever 以及旨在改进倒数排名融合的两种方法
- 相对分数融合 (Weaviate)
- 基于分布的分数融合 (Mazzeschi:博客文章)
In [ ]
已复制!
%pip install llama-index-llms-openai
%pip install llama-index-retrievers-bm25
%pip install llama-index-llms-openai %pip install llama-index-retrievers-bm25
In [ ]
已复制!
import os
import openai
os.environ["OPENAI_API_KEY"] = "sk-..."
openai.api_key = os.environ["OPENAI_API_KEY"]
import os import openai os.environ["OPENAI_API_KEY"] = "sk-..." openai.api_key = os.environ["OPENAI_API_KEY"]
设置¶
如果您正在 Colab 上打开此 Notebook,则可能需要安装 LlamaIndex 🦙。
下载数据
In [ ]
已复制!
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
!mkdir -p 'data/paul_graham/' !wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
In [ ]
已复制!
from llama_index.core import SimpleDirectoryReader
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
from llama_index.core import SimpleDirectoryReader documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
接下来,我们将设置文档的向量索引。
In [ ]
已复制!
from llama_index.core import VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter
splitter = SentenceSplitter(chunk_size=256)
index = VectorStoreIndex.from_documents(
documents, transformations=[splitter], show_progress=True
)
from llama_index.core import VectorStoreIndex from llama_index.core.node_parser import SentenceSplitter splitter = SentenceSplitter(chunk_size=256) index = VectorStoreIndex.from_documents( documents, transformations=[splitter], show_progress=True )
Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 7.55it/s] Generating embeddings: 100%|██████████| 504/504 [00:03<00:00, 128.32it/s]
首先,我们创建检索器。每个检索器将检索相似度最高的 10 个节点。
In [ ]
已复制!
from llama_index.retrievers.bm25 import BM25Retriever
vector_retriever = index.as_retriever(similarity_top_k=5)
bm25_retriever = BM25Retriever.from_defaults(
docstore=index.docstore, similarity_top_k=10
)
from llama_index.retrievers.bm25 import BM25Retriever vector_retriever = index.as_retriever(similarity_top_k=5) bm25_retriever = BM25Retriever.from_defaults( docstore=index.docstore, similarity_top_k=10 )
接下来,我们可以创建融合检索器,它将从两个检索器返回的 20 个节点中返回相似度最高的 10 个节点。
请注意,向量检索器和 BM25 检索器可能返回了完全相同的节点,只是顺序不同;在这种情况下,它仅充当重排序器。
In [ ]
已复制!
from llama_index.core.retrievers import QueryFusionRetriever
retriever = QueryFusionRetriever(
[vector_retriever, bm25_retriever],
retriever_weights=[0.6, 0.4],
similarity_top_k=10,
num_queries=1, # set this to 1 to disable query generation
mode="relative_score",
use_async=True,
verbose=True,
)
from llama_index.core.retrievers import QueryFusionRetriever retriever = QueryFusionRetriever( [vector_retriever, bm25_retriever], retriever_weights=[0.6, 0.4], similarity_top_k=10, num_queries=1, # 设置为 1 以禁用查询生成 mode="relative_score", use_async=True, verbose=True, )
In [ ]
已复制!
# apply nested async to run in a notebook
import nest_asyncio
nest_asyncio.apply()
# apply nested async to run in a notebook import nest_asyncio nest_asyncio.apply()
In [ ]
已复制!
nodes_with_scores = retriever.retrieve(
"What happened at Interleafe and Viaweb?"
)
nodes_with_scores = retriever.retrieve( "What happened at Interleafe and Viaweb?" )
In [ ]
已复制!
for node in nodes_with_scores:
print(f"Score: {node.score:.2f} - {node.text[:100]}...\n-----")
for node in nodes_with_scores: print(f"Score: {node.score:.2f} - {node.text[:100]}...\n-----")
Score: 0.60 - You wouldn't need versions, or ports, or any of that crap. At Interleaf there had been a whole group... ----- Score: 0.59 - The UI was horrible, but it proved you could build a whole store through the browser, without any cl... ----- Score: 0.40 - We were determined to be the Microsoft Word, not the Interleaf. Which meant being easy to use and in... ----- Score: 0.36 - In its time, the editor was one of the best general-purpose site builders. I kept the code tight and... ----- Score: 0.25 - I kept the code tight and didn't have to integrate with any other software except Robert's and Trevo... ----- Score: 0.25 - If all I'd had to do was work on this software, the next 3 years would have been the easiest of my l... ----- Score: 0.21 - To find out, we decided to try making a version of our store builder that you could control through ... ----- Score: 0.11 - But the most important thing I learned, and which I used in both Viaweb and Y Combinator, is that th... ----- Score: 0.11 - The next year, from the summer of 1998 to the summer of 1999, must have been the least productive of... ----- Score: 0.07 - The point is that it was really cheap, less than half market price. [8] Most software you can launc... -----
In [ ]
已复制!
from llama_index.core.retrievers import QueryFusionRetriever
retriever = QueryFusionRetriever(
[vector_retriever, bm25_retriever],
retriever_weights=[0.6, 0.4],
similarity_top_k=10,
num_queries=1, # set this to 1 to disable query generation
mode="dist_based_score",
use_async=True,
verbose=True,
)
nodes_with_scores = retriever.retrieve(
"What happened at Interleafe and Viaweb?"
)
for node in nodes_with_scores:
print(f"Score: {node.score:.2f} - {node.text[:100]}...\n-----")
from llama_index.core.retrievers import QueryFusionRetriever retriever = QueryFusionRetriever( [vector_retriever, bm25_retriever], retriever_weights=[0.6, 0.4], similarity_top_k=10, num_queries=1, # 设置为 1 以禁用查询生成 mode="dist_based_score", use_async=True, verbose=True, ) nodes_with_scores = retriever.retrieve( "What happened at Interleafe and Viaweb?" ) for node in nodes_with_scores: print(f"Score: {node.score:.2f} - {node.text[:100]}...\n-----")
Score: 0.42 - You wouldn't need versions, or ports, or any of that crap. At Interleaf there had been a whole group... ----- Score: 0.41 - The UI was horrible, but it proved you could build a whole store through the browser, without any cl... ----- Score: 0.32 - We were determined to be the Microsoft Word, not the Interleaf. Which meant being easy to use and in... ----- Score: 0.30 - In its time, the editor was one of the best general-purpose site builders. I kept the code tight and... ----- Score: 0.27 - To find out, we decided to try making a version of our store builder that you could control through ... ----- Score: 0.24 - I kept the code tight and didn't have to integrate with any other software except Robert's and Trevo... ----- Score: 0.24 - If all I'd had to do was work on this software, the next 3 years would have been the easiest of my l... ----- Score: 0.20 - Now we felt like we were really onto something. I had visions of a whole new generation of software ... ----- Score: 0.20 - Users wouldn't need anything more than a browser. This kind of software, known as a web app, is com... ----- Score: 0.18 - But the most important thing I learned, and which I used in both Viaweb and Y Combinator, is that th... -----
在查询引擎中使用!¶
现在,我们可以将检索器插入查询引擎,以合成自然语言响应。
In [ ]
已复制!
from llama_index.core.query_engine import RetrieverQueryEngine
query_engine = RetrieverQueryEngine.from_args(retriever)
from llama_index.core.query_engine import RetrieverQueryEngine query_engine = RetrieverQueryEngine.from_args(retriever)
In [ ]
已复制!
response = query_engine.query("What happened at Interleafe and Viaweb?")
response = query_engine.query("What happened at Interleafe and Viaweb?")
In [ ]
已复制!
from llama_index.core.response.notebook_utils import display_response
display_response(response)
from llama_index.core.response.notebook_utils import display_response display_response(response)
最终响应:
在 Interleaf,有一个名为 Release Engineering 的团队,其规模与编写软件的团队一样大。他们必须处理版本、端口和其他复杂性。相比之下,在 Viaweb,软件可以直接在服务器上更新,从而简化了流程。Viaweb 获得了 10,000 美元的种子资金,其软件允许通过浏览器构建整个商店,无需客户端软件或在服务器上进行命令行输入。该公司旨在易于使用且价格低廉,提供低廉的月度服务费用。