使用全文搜索的 Milvus 向量存储¶

全文搜索使用精确的关键词匹配，通常利用像 BM25 这样的算法按相关性对文档进行排序。在检索增强生成（RAG）系统中，这种方法检索相关文本以增强 AI 生成的响应。

同时，语义搜索解释上下文含义以提供更广泛的结果。结合这两种方法创建了一种混合搜索，可以改进信息检索，特别是在单一方法不足的情况下。

借助 Milvus 2.5 的 Sparse-BM25 方法，原始文本会自动转换为稀疏向量。这消除了手动生成稀疏嵌入的需要，并支持一种平衡语义理解和关键词相关性的混合搜索策略。

在本教程中，您将学习如何使用 LlamaIndex 和 Milvus 构建一个使用全文搜索和混合搜索的 RAG 系统。我们将首先单独实现全文搜索，然后通过集成语义搜索来增强它，以获得更全面的结果。

在继续本教程之前，请确保您熟悉全文搜索以及在 LlamaIndex 中使用 Milvus 的基础知识。

先决条件¶

安装依赖

开始之前，请确保您已安装以下依赖项

In [ ]

已复制！

%pip install llama-index-vector-stores-milvus
%pip install llama-index-embeddings-openai
%pip install llama-index-llms-openai
%pip install llama-index-vector-stores-milvus %pip install llama-index-embeddings-openai %pip install llama-index-llms-openai

如果您使用的是 Google Colab，您可能需要重新启动运行时（导航到界面顶部的“运行时”菜单，然后从下拉菜单中选择“重新启动会话”）。

设置账户

本教程使用 OpenAI 进行文本嵌入和答案生成。您需要准备 OpenAI API 密钥。

In [ ]

已复制！

import openai

openai.api_key = "sk-"
import openai openai.api_key = "sk-"

要使用 Milvus 向量存储，请指定您的 Milvus 服务器 URI（以及可选的 TOKEN）。要启动 Milvus 服务器，您可以按照Milvus 安装指南设置 Milvus 服务器，或者直接免费试用Zilliz Cloud。

全文搜索目前在 Milvus Standalone、Milvus Distributed 和 Zilliz Cloud 中受支持，但在 Milvus Lite 中尚不支持（计划在未来实现）。请联系 [email protected] 获取更多信息。

In [ ]

已复制！

URI = "https://:19530"
# TOKEN = ""
URI = "https://:19530" # TOKEN = ""

下载示例数据

运行以下命令将示例文档下载到“data/paul_graham”目录中

In [ ]

已复制！

%mkdir -p 'data/paul_graham/'
%wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
%mkdir -p 'data/paul_graham/' %wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

--2025-03-27 07:49:01--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75042 (73K) [text/plain]
Saving to: ‘data/paul_graham/paul_graham_essay.txt’

data/paul_graham/pa 100%[===================>]  73.28K  --.-KB/s    in 0.07s   

2025-03-27 07:49:01 (1.01 MB/s) - ‘data/paul_graham/paul_graham_essay.txt’ saved [75042/75042]

使用全文搜索的 RAG¶

将全文搜索集成到 RAG 系统中可以平衡语义搜索与精确且可预测的基于关键词的检索。您也可以选择仅使用全文搜索，但建议将全文搜索与语义搜索结合使用以获得更好的搜索结果。此处仅为演示目的，我们将展示单独的全文搜索和混合搜索。

首先，使用 SimpleDirectoryReaderLoad 加载 Paul Graham 的文章“我做过什么”

In [ ]

已复制！

from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader("./data/paul_graham/").load_data()

# Let's take a look at the first document
print("Example document:\n", documents[0])
from llama_index.core import SimpleDirectoryReader documents = SimpleDirectoryReader("./data/paul_graham/").load_data() # 让我们看看第一个文档 print("Example document:\n", documents[0])

Example document:
 Doc ID: 16b7942f-bf1a-4197-85e1-f31d51ea25a9
Text: What I Worked On  February 2021  Before college the two main
things I worked on, outside of school, were writing and programming. I
didn't write essays. I wrote what beginning writers were supposed to
write then, and probably still are: short stories. My stories were
awful. They had hardly any plot, just characters with strong feelings,
which I ...

使用 BM25 进行全文搜索¶

LlamaIndex 的 MilvusVectorStore 支持全文搜索，从而实现高效的基于关键词的检索。通过使用内置函数作为 sparse_embedding_function，它应用 BM25 评分对搜索结果进行排序。

在本节中，我们将演示如何使用 BM25 实现一个用于全文搜索的 RAG 系统。

In [ ]

已复制！





from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.vector_stores.milvus import MilvusVectorStore
from llama_index.vector_stores.milvus.utils import BM25BuiltInFunction
from llama_index.core import Settings

# Skip dense embedding model
Settings.embed_model = None

# Build Milvus vector store creating a new collection
vector_store = MilvusVectorStore(
    uri=URI,
    # token=TOKEN,
    enable_dense=False,
    enable_sparse=True,  # Only enable sparse to demo full text search
    sparse_embedding_function=BM25BuiltInFunction(),
    overwrite=True,
)

# Store documents in Milvus
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)
from llama_index.core import VectorStoreIndex, StorageContext from llama_index.vector_stores.milvus import MilvusVectorStore from llama_index.vector_stores.milvus.utils import BM25BuiltInFunction from llama_index.core import Settings # 跳过密集嵌入模型 Settings.embed_model = None # 构建 Milvus 向量存储，创建一个新的集合 vector_store = MilvusVectorStore( uri=URI, # token=TOKEN, enable_dense=False, enable_sparse=True, # 只启用稀疏以演示全文搜索 sparse_embedding_function=BM25BuiltInFunction(), overwrite=True, ) # 在 Milvus 中存储文档 storage_context = StorageContext.from_defaults(vector_store=vector_store) index = VectorStoreIndex.from_documents( documents, storage_context=storage_context )

Embeddings have been explicitly disabled. Using MockEmbedding.

上述代码将示例文档插入 Milvus 并构建索引，以启用 BM25 排名进行全文搜索。它禁用了密集嵌入，并使用默认参数的 BM25BuiltInFunction。

您可以在 BM25BuiltInFunction 参数中指定输入和输出字段

input_field_names (str)：输入文本字段（默认："text"）。它指示 BM25 算法应用于哪个文本字段。如果使用带有不同文本字段名称的您自己的集合，请更改此项。
output_field_names (str)：此 BM25 函数的输出存储在哪个字段中（默认："sparse_embedding"）。

设置好向量存储后，您可以使用 Milvus 执行全文搜索查询，查询模式为 "sparse" 或 "text_search"

In [ ]

已复制！

import textwrap

query_engine = index.as_query_engine(
    vector_store_query_mode="sparse", similarity_top_k=5
)
answer = query_engine.query("What did the author learn at Viaweb?")
print(textwrap.fill(str(answer), 100))
import textwrap query_engine = index.as_query_engine( vector_store_query_mode="sparse", similarity_top_k=5 ) answer = query_engine.query("作者在 Viaweb 学到了什么？") print(textwrap.fill(str(answer), 100))

The author learned several important lessons at Viaweb. They learned about the importance of growth
rate as the ultimate test of a startup, the value of building stores for users to understand retail
and software usability, and the significance of being the "entry level" option in a market.
Additionally, they discovered the accidental success of making Viaweb inexpensive, the challenges of
hiring too many people, and the relief felt when the company was acquired by Yahoo.

自定义文本分析器¶

分析器通过将句子分解为标记并执行词法处理（例如词干提取和停用词删除）在全文搜索中发挥着至关重要的作用。它们通常是语言特定的。有关更多详细信息，请参阅Milvus 分析器指南。

Milvus 支持两种类型的分析器：内置分析器和自定义分析器。默认情况下，BM25BuiltInFunction 使用标准的内置分析器，该分析器根据标点符号对文本进行分词。

要使用不同的分析器或自定义现有分析器，可以将值传递给 analyzer_params 参数

In [ ]

已复制！





bm25_function = BM25BuiltInFunction(
    analyzer_params={
        "tokenizer": "standard",
        "filter": [
            "lowercase",  # Built-in filter
            {"type": "length", "max": 40},  # Custom cap size of a single token
            {"type": "stop", "stop_words": ["of", "to"]},  # Custom stopwords
        ],
    },
    enable_match=True,
)
bm25_function = BM25BuiltInFunction( analyzer_params={ "tokenizer": "standard", "filter": [ "lowercase", # 内置过滤器 {"type": "length", "max": 40}, # 自定义单个标记的最大长度 {"type": "stop", "stop_words": ["of", "to"]}, # 自定义停用词 ], }, enable_match=True, )

使用重排序器的混合搜索¶

混合搜索系统结合了语义搜索和全文搜索，优化了 RAG 系统中的检索性能。

以下示例使用 OpenAI 嵌入进行语义搜索，使用 BM25 进行全文搜索

In [ ]

已复制！





# Create index over the documnts
vector_store = MilvusVectorStore(
    uri=URI,
    # token=TOKEN,
    # enable_dense=True,  # enable_dense defaults to True
    dim=1536,
    enable_sparse=True,
    sparse_embedding_function=BM25BuiltInFunction(),
    overwrite=True,
    # hybrid_ranker="RRFRanker",  # hybrid_ranker defaults to "RRFRanker"
    # hybrid_ranker_params={},  # hybrid_ranker_params defaults to {}
)

storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents,
    storage_context=storage_context,
    embed_model="default",  # "default" will use OpenAI embedding
)
# 为文档创建索引 vector_store = MilvusVectorStore( uri=URI, # token=TOKEN, # enable_dense=True, # enable_dense 默认为 True dim=1536, enable_sparse=True, sparse_embedding_function=BM25BuiltInFunction(), overwrite=True, # hybrid_ranker="RRFRanker", # hybrid_ranker 默认为 "RRFRanker" # hybrid_ranker_params={}, # hybrid_ranker_params 默认为 {} ) storage_context = StorageContext.from_defaults(vector_store=vector_store) index = VectorStoreIndex.from_documents( documents, storage_context=storage_context, embed_model="default", # "default" 将使用 OpenAI 嵌入 )

工作原理

这种方法将文档存储在 Milvus 集合中，包含两个向量字段

embedding：由 OpenAI 嵌入模型生成的密集嵌入，用于语义搜索。
sparse_embedding：使用 BM25BuiltInFunction 计算的稀疏嵌入，用于全文搜索。

此外，我们还使用了“RRFRanker”及其默认参数应用了重排序策略。要自定义重排序器，您可以按照Milvus 重排序指南配置 hybrid_ranker 和 hybrid_ranker_params。

现在，让我们使用示例查询测试 RAG 系统

In [ ]

已复制！





# Query
query_engine = index.as_query_engine(
    vector_store_query_mode="hybrid", similarity_top_k=5
)
answer = query_engine.query("What did the author learn at Viaweb?")
print(textwrap.fill(str(answer), 100))
# 查询 query_engine = index.as_query_engine( vector_store_query_mode="hybrid", similarity_top_k=5 ) answer = query_engine.query("作者在 Viaweb 学到了什么？") print(textwrap.fill(str(answer), 100))

The author learned several important lessons at Viaweb. These included the importance of
understanding growth rate as the ultimate test of a startup, the impact of hiring too many people,
the challenges of being at the mercy of investors, and the relief experienced when Yahoo bought the
company. Additionally, the author learned about the significance of user feedback, the value of
building stores for users, and the realization that growth rate is crucial for the long-term success
of a startup.

这种混合方法通过利用语义检索和基于关键词的检索，确保 RAG 系统中响应更加准确且感知上下文。