Milvus 向量数据库混合搜索¶

混合搜索结合了语义检索和关键词匹配的优势，以提供更准确和上下文相关的结果。通过结合语义搜索和关键词匹配的优势，混合搜索在复杂的信息检索任务中尤其有效。

本 Notebook 演示了如何在 LlamaIndex RAG 管道中使用 Milvus 进行混合搜索。我们将从推荐的默认混合搜索（语义 + BM25）开始，然后探索其他替代的稀疏嵌入方法和自定义混合重排序器。

前提条件¶

安装依赖

在开始之前，请确保已安装以下依赖

In [ ]

已复制！

! pip install llama-index-vector-stores-milvus
! pip install llama-index-embeddings-openai
! pip install llama-index-llms-openai
! pip install llama-index-vector-stores-milvus ! pip install llama-index-embeddings-openai ! pip install llama-index-llms-openai

如果您正在使用 Google Colab，可能需要 重新启动运行时（导航到界面顶部的“运行时”菜单，并从下拉菜单中选择“重新启动会话”。）

设置账户

本教程使用 OpenAI 进行文本嵌入和答案生成。您需要准备 OpenAI API 密钥。

In [ ]

已复制！

import openai

openai.api_key = "sk-"
import openai openai.api_key = "sk-"

要使用 Milvus 向量数据库，请指定您的 Milvus 服务器 URI（并可选地指定 TOKEN）。要启动 Milvus 服务器，您可以按照 Milvus 安装指南设置 Milvus 服务器，或者简单地免费试用 Zilliz Cloud。

全文搜索目前在 Milvus Standalone、Milvus Distributed 和 Zilliz Cloud 中得到支持，但在 Milvus Lite 中尚未支持（计划在未来实现）。有关更多信息，请联系 [email protected]。

In [ ]

已复制！

URI = "http://localhost:19530"
# TOKEN = ""
URI = "http://localhost:19530" # TOKEN = ""

加载示例数据

运行以下命令将示例文档下载到 "data/paul_graham" 目录中

In [ ]

已复制！

! mkdir -p 'data/paul_graham/'
! wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
! mkdir -p 'data/paul_graham/' ! wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

然后使用 SimpleDirectoryReaderLoad 加载 Paul Graham 的文章“我做了什么工作”

In [ ]

已复制！

from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader("./data/paul_graham/").load_data()

# Let's take a look at the first document
print("Example document:\n", documents[0])
from llama_index.core import SimpleDirectoryReader documents = SimpleDirectoryReader("./data/paul_graham/").load_data() # 让我们看看第一个文档 print("示例文档：\n", documents[0])

Example document:
 Doc ID: f9cece8c-9022-46d8-9d0e-f29d70e1dbbe
Text: What I Worked On  February 2021  Before college the two main
things I worked on, outside of school, were writing and programming. I
didn't write essays. I wrote what beginning writers were supposed to
write then, and probably still are: short stories. My stories were
awful. They had hardly any plot, just characters with strong feelings,
which I ...

使用 BM25 的混合搜索¶

本节展示了如何使用 BM25 执行混合搜索。首先，我们将初始化 MilvusVectorStore 并为示例文档创建一个索引。默认配置使用

来自默认嵌入模型（OpenAI 的 text-embedding-ada-002）的密集嵌入
如果 enable_sparse 为 True，则使用 BM25 进行全文搜索
如果启用混合搜索，则使用 k=60 的 RRFRanker 组合结果

In [ ]

已复制！





# Create an index over the documnts
from llama_index.vector_stores.milvus import MilvusVectorStore
from llama_index.core import StorageContext, VectorStoreIndex


vector_store = MilvusVectorStore(
    uri=URI,
    # token=TOKEN,
    dim=1536,  # vector dimension depends on the embedding model
    enable_sparse=True,  # enable the default full-text search using BM25
    overwrite=True,  # drop the collection if it already exists
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)
# 在文档上创建索引 from llama_index.vector_stores.milvus import MilvusVectorStore from llama_index.core import StorageContext, VectorStoreIndex vector_store = MilvusVectorStore( uri=URI, # token=TOKEN, dim=1536, # 向量维度取决于嵌入模型 enable_sparse=True, # 启用默认的全文搜索使用 BM25 overwrite=True, # 如果集合已存在，则删除 ) storage_context = StorageContext.from_defaults(vector_store=vector_store) index = VectorStoreIndex.from_documents( documents, storage_context=storage_context )

2025-04-17 03:38:16,645 [DEBUG][_create_connection]: Created new connection using: cf0f4df74b18418bb89ec512063c1244 (async_milvus_client.py:547)
Sparse embedding function is not provided, using default.
Default sparse embedding function: BM25BuiltInFunction(input_field_names='text', output_field_names='sparse_embedding').

以下是有关配置 MilvusVectorStore 中密集和稀疏字段参数的更多信息

密集字段

enable_dense (bool): 一个布尔标志，用于启用或禁用密集嵌入。默认为 True。
dim (int, optional): 集合的嵌入向量维度。
embedding_field (str, optional): 集合的密集嵌入字段名称，默认为 DEFAULT_EMBEDDING_KEY。
index_config (dict, optional): 用于构建密集嵌入索引的配置。默认为 None。
search_config (dict, optional): 用于搜索 Milvus 密集索引的配置。注意，这必须与 index_config 指定的索引类型兼容。默认为 None。
similarity_metric (str, optional): 用于密集嵌入的相似度指标，当前支持 IP、COSINE 和 L2。

稀疏字段

enable_sparse (bool): 一个布尔标志，用于启用或禁用稀疏嵌入。默认为 False。
sparse_embedding_field (str): 稀疏嵌入字段名称，默认为 DEFAULT_SPARSE_EMBEDDING_KEY。
sparse_embedding_function (Union[BaseSparseEmbeddingFunction, BaseMilvusBuiltInFunction], optional): 如果 enable_sparse 为 True，应提供此对象以将文本转换为稀疏嵌入。如果为 None，将使用默认的稀疏嵌入函数 (BM25BuiltInFunction)，或者对于现有集合且不带内置函数时，使用 BGEM3SparseEmbedding。
sparse_index_config (dict, optional): 用于构建稀疏嵌入索引的配置。默认为 None。

要在查询阶段启用混合搜索，将 vector_store_query_mode 设置为 "hybrid"。这将结合并重排序来自语义搜索和全文搜索的搜索结果。让我们用一个示例查询进行测试："作者在 Viaweb 学到了什么？"

In [ ]

已复制！

import textwrap

query_engine = index.as_query_engine(
    vector_store_query_mode="hybrid", similarity_top_k=5
)
response = query_engine.query("What did the author learn at Viaweb?")
print(textwrap.fill(str(response), 100))
import textwrap query_engine = index.as_query_engine( vector_store_query_mode="hybrid", similarity_top_k=5 ) response = query_engine.query("作者在 Viaweb 学到了什么？") print(textwrap.fill(str(response), 100))

The author learned about retail, the importance of user feedback, and the significance of growth
rate as the ultimate test of a startup at Viaweb.

自定义文本分析器¶

分析器在全文搜索中扮演着至关重要的角色，通过将句子分解为 token 并执行词法处理，例如词干提取和停用词移除。它们通常是语言特定的。更多详细信息请参阅 Milvus 分析器指南。

Milvus 支持两种类型的分析器：**内置分析器**和**自定义分析器**。默认情况下，如果 enable_sparse 设置为 True，MilvusVectorStore 将使用 BM25BuiltInFunction 的默认配置，该配置使用标准内置分析器，根据标点符号对文本进行 token 化。

要使用不同的分析器或自定义现有分析器，您可以在构建 BM25BuiltInFunction 时为 analyzer_params 参数提供值。然后，将此函数设置为 MilvusVectorStore 中的 sparse_embedding_function。

In [ ]

已复制！





from llama_index.vector_stores.milvus.utils import BM25BuiltInFunction

bm25_function = BM25BuiltInFunction(
    analyzer_params={
        "tokenizer": "standard",
        "filter": [
            "lowercase",  # Built-in filter
            {"type": "length", "max": 40},  # Custom cap size of a single token
            {"type": "stop", "stop_words": ["of", "to"]},  # Custom stopwords
        ],
    },
    enable_match=True,
)

vector_store = MilvusVectorStore(
    uri=URI,
    # token=TOKEN,
    dim=1536,
    enable_sparse=True,
    sparse_embedding_function=bm25_function,  # BM25 with custom analyzer
    overwrite=True,
)
from llama_index.vector_stores.milvus.utils import BM25BuiltInFunction bm25_function = BM25BuiltInFunction( analyzer_params={ "tokenizer": "standard", "filter": [ "lowercase", # 内置过滤器 {"type": "length", "max": 40}, # 单个 token 的自定义大小限制 {"type": "stop", "stop_words": ["of", "to"]}, # 自定义停用词 ], }, enable_match=True, ) vector_store = MilvusVectorStore( uri=URI, # token=TOKEN, dim=1536, enable_sparse=True, sparse_embedding_function=bm25_function, # 带自定义分析器的 BM25 overwrite=True, )

2025-04-17 03:38:48,085 [DEBUG][_create_connection]: Created new connection using: 61afd81600cb46ee89f887f16bcbfe55 (async_milvus_client.py:547)

使用其他稀疏嵌入的混合搜索¶

除了将语义搜索与 BM25 结合外，Milvus 还支持使用稀疏嵌入函数（如 BGE-M3）进行混合搜索。以下示例使用内置的 BGEM3SparseEmbeddingFunction 生成稀疏嵌入。

首先，我们需要安装 FlagEmbedding 包

In [ ]

已复制！

! pip install -q FlagEmbedding
! pip install -q FlagEmbedding

然后，使用默认 OpenAI 模型用于密集嵌入和内置 BGE-M3 用于稀疏嵌入构建向量数据库和索引

In [ ]

已复制！





from llama_index.vector_stores.milvus.utils import BGEM3SparseEmbeddingFunction

vector_store = MilvusVectorStore(
    uri=URI,
    # token=TOKEN,
    dim=1536,
    enable_sparse=True,
    sparse_embedding_function=BGEM3SparseEmbeddingFunction(),
    overwrite=True,
)

storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)
from llama_index.vector_stores.milvus.utils import BGEM3SparseEmbeddingFunction vector_store = MilvusVectorStore( uri=URI, # token=TOKEN, dim=1536, enable_sparse=True, sparse_embedding_function=BGEM3SparseEmbeddingFunction(), overwrite=True, ) storage_context = StorageContext.from_defaults(vector_store=vector_store) index = VectorStoreIndex.from_documents( documents, storage_context=storage_context )

Fetching 30 files: 100%|██████████| 30/30 [00:00<00:00, 68871.99it/s]
2025-04-17 03:39:02,074 [DEBUG][_create_connection]: Created new connection using: ff4886e2f8da44e08304b748d9ac9b51 (async_milvus_client.py:547)
Chunks: 100%|██████████| 1/1 [00:00<00:00,  1.07it/s]

现在，让我们用一个示例问题执行混合搜索查询

In [ ]

已复制！

query_engine = index.as_query_engine(
    vector_store_query_mode="hybrid", similarity_top_k=5
)
response = query_engine.query("What did the author learn at Viaweb??")
print(textwrap.fill(str(response), 100))
query_engine = index.as_query_engine( vector_store_query_mode="hybrid", similarity_top_k=5 ) response = query_engine.query("作者在 Viaweb 学到了什么？") print(textwrap.fill(str(response), 100))

Chunks: 100%|██████████| 1/1 [00:00<00:00, 17.29it/s]

The author learned about retail, the importance of user feedback, the value of growth rate in a
startup, the significance of pricing strategy, the benefits of working on things that weren't
prestigious, and the challenges and rewards of running a startup.

自定义稀疏嵌入函数¶

您还可以自定义稀疏嵌入函数，只要它继承自 BaseSparseEmbeddingFunction，包括以下方法

encode_queries: 此方法将文本转换为用于查询的稀疏嵌入列表。
encode_documents: 此方法将文本转换为用于文档的稀疏嵌入列表。

每个方法的输出应遵循稀疏嵌入的格式，即字典列表。每个字典应有一个表示维度的键（整数），以及一个对应的表示嵌入在该维度上的幅度的值（浮点数）（例如，{1: 0.5, 2: 0.3}）。

例如，这是一个使用 BGE-M3 的自定义稀疏嵌入函数实现

In [ ]

已复制！





from FlagEmbedding import BGEM3FlagModel
from typing import List
from llama_index.vector_stores.milvus.utils import BaseSparseEmbeddingFunction


class ExampleEmbeddingFunction(BaseSparseEmbeddingFunction):
    def __init__(self):
        self.model = BGEM3FlagModel("BAAI/bge-m3", use_fp16=False)

    def encode_queries(self, queries: List[str]):
        outputs = self.model.encode(
            queries,
            return_dense=False,
            return_sparse=True,
            return_colbert_vecs=False,
        )["lexical_weights"]
        return [self._to_standard_dict(output) for output in outputs]

    def encode_documents(self, documents: List[str]):
        outputs = self.model.encode(
            documents,
            return_dense=False,
            return_sparse=True,
            return_colbert_vecs=False,
        )["lexical_weights"]
        return [self._to_standard_dict(output) for output in outputs]

    def _to_standard_dict(self, raw_output):
        result = {}
        for k in raw_output:
            result[int(k)] = raw_output[k]
        return result
from FlagEmbedding import BGEM3FlagModel from typing import List from llama_index.vector_stores.milvus.utils import BaseSparseEmbeddingFunction class ExampleEmbeddingFunction(BaseSparseEmbeddingFunction): def __init__(self): self.model = BGEM3FlagModel("BAAI/bge-m3", use_fp16=False) def encode_queries(self, queries: List[str]): outputs = self.model.encode( queries, return_dense=False, return_sparse=True, return_colbert_vecs=False, )["lexical_weights"] return [self._to_standard_dict(output) for output in outputs] def encode_documents(self, documents: List[str]): outputs = self.model.encode( documents, return_dense=False, return_sparse=True, return_colbert_vecs=False, )["lexical_weights"] return [self._to_standard_dict(output) for output in outputs] def _to_standard_dict(self, raw_output): result = {} for k in raw_output: result[int(k)] = raw_output[k] return result

自定义混合重排序器¶

Milvus 支持两种类型的重排序策略：互惠排名融合 (RRF) 和加权评分。MilvusVectorStore 混合搜索中的默认重排序器是 RRF，k=60。要自定义混合重排序器，请修改以下参数

hybrid_ranker (str): 指定混合搜索查询中使用的重排序器类型。目前仅支持 ["RRFRanker", "WeightedRanker"]。默认为 "RRFRanker"。
hybrid_ranker_params (dict, optional): 混合重排序器的配置参数。此字典的结构取决于所使用的具体重排序器
- 对于 "RRFRanker"，应包含
  - "k" (int): 互惠排名融合 (RRF) 中使用的参数。此值用于计算排名分数，作为 RRF 算法的一部分，该算法结合多种排名策略为一个单一分数，以提高搜索相关性。如果未指定，默认值为 60。
- 对于 "WeightedRanker"，期望包含
  - "weights" (list of float): 一个包含两个权重的列表
    1. 密集嵌入分量的权重。
    2. 稀疏嵌入分量的权重。这些权重用于平衡密集和稀疏嵌入分量在混合检索过程中的重要性。如果未指定，默认权重为 [1.0, 1.0]。

In [ ]

已复制！





vector_store = MilvusVectorStore(
    uri=URI,
    # token=TOKEN,
    dim=1536,
    overwrite=False,  # Use the existing collection created in the previous example
    enable_sparse=True,
    hybrid_ranker="WeightedRanker",
    hybrid_ranker_params={"weights": [1.0, 0.5]},
)
index = VectorStoreIndex.from_vector_store(vector_store)
query_engine = index.as_query_engine(
    vector_store_query_mode="hybrid", similarity_top_k=5
)
response = query_engine.query("What did the author learn at Viaweb?")
print(textwrap.fill(str(response), 100))
vector_store = MilvusVectorStore( uri=URI, # token=TOKEN, dim=1536, overwrite=False, # 使用在上一个示例中创建的现有集合 enable_sparse=True, hybrid_ranker="WeightedRanker", hybrid_ranker_params={"weights": [1.0, 0.5]}, ) index = VectorStoreIndex.from_vector_store(vector_store) query_engine = index.as_query_engine( vector_store_query_mode="hybrid", similarity_top_k=5 ) response = query_engine.query("作者在 Viaweb 学到了什么？") print(textwrap.fill(str(response), 100))

2025-04-17 03:44:00,419 [DEBUG][_create_connection]: Created new connection using: 09c051fb18c04f97a80f07958856587b (async_milvus_client.py:547)
Sparse embedding function is not provided, using default.
No built-in function detected, using BGEM3SparseEmbeddingFunction().
Fetching 30 files: 100%|██████████| 30/30 [00:00<00:00, 136622.28it/s]
Chunks: 100%|██████████| 1/1 [00:00<00:00,  1.07it/s]

The author learned several valuable lessons at Viaweb, including the importance of understanding
growth rate as the ultimate test of a startup, the significance of user feedback in shaping the
software, and the realization that web applications were the future of software development.
Additionally, the experience at Viaweb taught the author about the challenges and rewards of running
a startup, the value of simplicity in software design, and the impact of pricing strategies on
attracting customers.