相对分数融合和基于分布的分数融合¶

在此示例中，我们将演示如何使用 QueryFusionRetriever 以及旨在改进倒数排名融合的两种方法

相对分数融合 (Weaviate)
基于分布的分数融合 (Mazzeschi：博客文章)

In [ ]

已复制！

%pip install llama-index-llms-openai
%pip install llama-index-retrievers-bm25
%pip install llama-index-llms-openai %pip install llama-index-retrievers-bm25

In [ ]

已复制！

import os
import openai

os.environ["OPENAI_API_KEY"] = "sk-..."
openai.api_key = os.environ["OPENAI_API_KEY"]
import os import openai os.environ["OPENAI_API_KEY"] = "sk-..." openai.api_key = os.environ["OPENAI_API_KEY"]

设置¶

如果您正在 Colab 上打开此 Notebook，则可能需要安装 LlamaIndex 🦙。

下载数据

In [ ]

已复制！

!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
!mkdir -p 'data/paul_graham/' !wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

In [ ]

已复制！

from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
from llama_index.core import SimpleDirectoryReader documents = SimpleDirectoryReader("./data/paul_graham/").load_data()

接下来，我们将设置文档的向量索引。

In [ ]

已复制！

from llama_index.core import VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(chunk_size=256)

index = VectorStoreIndex.from_documents(
    documents, transformations=[splitter], show_progress=True
)
from llama_index.core import VectorStoreIndex from llama_index.core.node_parser import SentenceSplitter splitter = SentenceSplitter(chunk_size=256) index = VectorStoreIndex.from_documents( documents, transformations=[splitter], show_progress=True )

Parsing nodes: 100%|██████████| 1/1 [00:00<00:00,  7.55it/s]
Generating embeddings: 100%|██████████| 504/504 [00:03<00:00, 128.32it/s]

使用相对分数融合创建混合融合检索器¶

在此步骤中，我们将索引与基于 BM25 的检索器融合。这将使我们能够在输入查询中同时捕获语义关系和关键词。

由于这两个检索器都计算分数，我们可以使用 QueryFusionRetriever 重新排序节点，而无需使用额外的模型或进行过多的计算。

以下示例使用 Weaviate 的相对分数融合算法，该算法对每个结果集应用 MinMax 缩放器，然后进行加权求和。这里，我们将向量检索器赋予比 BM25 略高的权重（0.6 对比 0.4）。

首先，我们创建检索器。每个检索器将检索相似度最高的 10 个节点。

In [ ]

已复制！

from llama_index.retrievers.bm25 import BM25Retriever

vector_retriever = index.as_retriever(similarity_top_k=5)

bm25_retriever = BM25Retriever.from_defaults(
    docstore=index.docstore, similarity_top_k=10
)
from llama_index.retrievers.bm25 import BM25Retriever vector_retriever = index.as_retriever(similarity_top_k=5) bm25_retriever = BM25Retriever.from_defaults( docstore=index.docstore, similarity_top_k=10 )

接下来，我们可以创建融合检索器，它将从两个检索器返回的 20 个节点中返回相似度最高的 10 个节点。

请注意，向量检索器和 BM25 检索器可能返回了完全相同的节点，只是顺序不同；在这种情况下，它仅充当重排序器。

In [ ]

已复制！





from llama_index.core.retrievers import QueryFusionRetriever

retriever = QueryFusionRetriever(
    [vector_retriever, bm25_retriever],
    retriever_weights=[0.6, 0.4],
    similarity_top_k=10,
    num_queries=1,  # set this to 1 to disable query generation
    mode="relative_score",
    use_async=True,
    verbose=True,
)
from llama_index.core.retrievers import QueryFusionRetriever retriever = QueryFusionRetriever( [vector_retriever, bm25_retriever], retriever_weights=[0.6, 0.4], similarity_top_k=10, num_queries=1, # 设置为 1 以禁用查询生成 mode="relative_score", use_async=True, verbose=True, )

In [ ]

已复制！

# apply nested async to run in a notebook
import nest_asyncio

nest_asyncio.apply()
# apply nested async to run in a notebook import nest_asyncio nest_asyncio.apply()

In [ ]

已复制！

nodes_with_scores = retriever.retrieve(
    "What happened at Interleafe and Viaweb?"
)
nodes_with_scores = retriever.retrieve( "What happened at Interleafe and Viaweb?" )

In [ ]

已复制！

for node in nodes_with_scores:
    print(f"Score: {node.score:.2f} - {node.text[:100]}...\n-----")
for node in nodes_with_scores: print(f"Score: {node.score:.2f} - {node.text[:100]}...\n-----")

Score: 0.60 - You wouldn't need versions, or ports, or any of that crap. At Interleaf there had been a whole group...
-----
Score: 0.59 - The UI was horrible, but it proved you could build a whole store through the browser, without any cl...
-----
Score: 0.40 - We were determined to be the Microsoft Word, not the Interleaf. Which meant being easy to use and in...
-----
Score: 0.36 - In its time, the editor was one of the best general-purpose site builders. I kept the code tight and...
-----
Score: 0.25 - I kept the code tight and didn't have to integrate with any other software except Robert's and Trevo...
-----
Score: 0.25 - If all I'd had to do was work on this software, the next 3 years would have been the easiest of my l...
-----
Score: 0.21 - To find out, we decided to try making a version of our store builder that you could control through ...
-----
Score: 0.11 - But the most important thing I learned, and which I used in both Viaweb and Y Combinator, is that th...
-----
Score: 0.11 - The next year, from the summer of 1998 to the summer of 1999, must have been the least productive of...
-----
Score: 0.07 - The point is that it was really cheap, less than half market price.

[8] Most software you can launc...
-----

基于分布的分数融合¶

作为相对分数融合的一种变体，基于分布的分数融合以略微不同的方式缩放分数——基于每个结果集分数的均值和标准差。

In [ ]

已复制！





from llama_index.core.retrievers import QueryFusionRetriever

retriever = QueryFusionRetriever(
    [vector_retriever, bm25_retriever],
    retriever_weights=[0.6, 0.4],
    similarity_top_k=10,
    num_queries=1,  # set this to 1 to disable query generation
    mode="dist_based_score",
    use_async=True,
    verbose=True,
)

nodes_with_scores = retriever.retrieve(
    "What happened at Interleafe and Viaweb?"
)

for node in nodes_with_scores:
    print(f"Score: {node.score:.2f} - {node.text[:100]}...\n-----")
from llama_index.core.retrievers import QueryFusionRetriever retriever = QueryFusionRetriever( [vector_retriever, bm25_retriever], retriever_weights=[0.6, 0.4], similarity_top_k=10, num_queries=1, # 设置为 1 以禁用查询生成 mode="dist_based_score", use_async=True, verbose=True, ) nodes_with_scores = retriever.retrieve( "What happened at Interleafe and Viaweb?" ) for node in nodes_with_scores: print(f"Score: {node.score:.2f} - {node.text[:100]}...\n-----")

Score: 0.42 - You wouldn't need versions, or ports, or any of that crap. At Interleaf there had been a whole group...
-----
Score: 0.41 - The UI was horrible, but it proved you could build a whole store through the browser, without any cl...
-----
Score: 0.32 - We were determined to be the Microsoft Word, not the Interleaf. Which meant being easy to use and in...
-----
Score: 0.30 - In its time, the editor was one of the best general-purpose site builders. I kept the code tight and...
-----
Score: 0.27 - To find out, we decided to try making a version of our store builder that you could control through ...
-----
Score: 0.24 - I kept the code tight and didn't have to integrate with any other software except Robert's and Trevo...
-----
Score: 0.24 - If all I'd had to do was work on this software, the next 3 years would have been the easiest of my l...
-----
Score: 0.20 - Now we felt like we were really onto something. I had visions of a whole new generation of software ...
-----
Score: 0.20 - Users wouldn't need anything more than a browser.

This kind of software, known as a web app, is com...
-----
Score: 0.18 - But the most important thing I learned, and which I used in both Viaweb and Y Combinator, is that th...
-----

在查询引擎中使用！¶

现在，我们可以将检索器插入查询引擎，以合成自然语言响应。

In [ ]

已复制！

from llama_index.core.query_engine import RetrieverQueryEngine

query_engine = RetrieverQueryEngine.from_args(retriever)
from llama_index.core.query_engine import RetrieverQueryEngine query_engine = RetrieverQueryEngine.from_args(retriever)

In [ ]

已复制！

response = query_engine.query("What happened at Interleafe and Viaweb?")
response = query_engine.query("What happened at Interleafe and Viaweb?")

In [ ]

已复制！

from llama_index.core.response.notebook_utils import display_response

display_response(response)
from llama_index.core.response.notebook_utils import display_response display_response(response)

最终响应： 在 Interleaf，有一个名为 Release Engineering 的团队，其规模与编写软件的团队一样大。他们必须处理版本、端口和其他复杂性。相比之下，在 Viaweb，软件可以直接在服务器上更新，从而简化了流程。Viaweb 获得了 10,000 美元的种子资金，其软件允许通过浏览器构建整个商店，无需客户端软件或在服务器上进行命令行输入。该公司旨在易于使用且价格低廉，提供低廉的月度服务费用。