Milvus 向量存储¶

本指南演示了如何使用 LlamaIndex 和 Milvus 构建检索增强生成 (RAG) 系统。

RAG 系统结合了检索系统和生成模型，根据给定的提示生成新文本。系统首先使用像 Milvus 这样的向量相似度搜索引擎从语料库中检索相关文档，然后使用生成模型根据检索到的文档生成新文本。

Milvus 是世界上最先进的开源向量数据库，旨在为嵌入相似度搜索和 AI 应用提供支持。

在本 Notebook 中，我们将快速演示如何使用 MilvusVectorStore。

开始之前¶

安装依赖项¶

如果您在 Colab 上打开此 Notebook，您可能需要安装 LlamaIndex 🦙。

In [ ]

已复制！

%pip install llama-index-vector-stores-milvus
%pip install llama-index-vector-stores-milvus

In [ ]

已复制！

%pip install llama-index
%pip install llama-index

本 Notebook 将使用 Milvus Lite，需要更高版本的 pymilvus

In [ ]

已复制！

%pip install pymilvus>=2.4.2
%pip install pymilvus>=2.4.2

如果您正在使用 Google Colab，为了启用刚刚安装的依赖项，您可能需要重启运行时（点击屏幕顶部的“运行时”菜单，然后从下拉菜单中选择“重新启动会话”）。

设置 OpenAI¶

首先添加 OpenAI API 密钥。这将使我们能够访问 ChatGPT。

In [ ]

已复制！

import openai

openai.api_key = "sk-***********"
import openai openai.api_key = "sk-***********"

准备数据¶

您可以使用以下命令下载示例数据

In [ ]

已复制！

! mkdir -p "data/"
! wget "https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt" -O "data/paul_graham_essay.txt"
! wget "https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10k/uber_2021.pdf" -O "data/uber_2021.pdf"
! mkdir -p "data/" ! wget "https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt" -O "data/paul_graham_essay.txt" ! wget "https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10k/uber_2021.pdf" -O "data/uber_2021.pdf"

入门¶

生成我们的数据¶

作为第一个例子，让我们从文件 paul_graham_essay.txt 生成一个文档。这是 Paul Graham 的一篇题为 What I Worked On 的文章。为了生成文档，我们将使用 SimpleDirectoryReader。

In [ ]

已复制！

from llama_index.core import SimpleDirectoryReader

# load documents
documents = SimpleDirectoryReader(
    input_files=["./data/paul_graham_essay.txt"]
).load_data()

print("Document ID:", documents[0].doc_id)
from llama_index.core import SimpleDirectoryReader # load documents documents = SimpleDirectoryReader( input_files=["./data/paul_graham_essay.txt"] ).load_data() print("Document ID:", documents[0].doc_id)

Document ID: 95f25e4d-f270-4650-87ce-006d69d82033

创建数据的索引¶

现在我们有了文档，我们可以创建索引并插入文档。对于索引，我们将使用 MilvusVectorStore。MilvusVectorStore 接受一些参数：

基本参数

uri (str, optional): 连接 URI，对于 Milvus 或 Zilliz Cloud 服务，格式为 "https://address:port"；对于本地 lite Milvus，格式为 "path/to/local/milvus.db"。默认为 "./milvus_llamaindex.db"。
token (str, optional): 登录令牌。如果不使用 rbac，则为空；如果使用 rbac，则很可能是 "username:password"。
collection_name (str, optional): 数据将存储在其中的集合名称。默认为 "llamalection"。
overwrite (bool, optional): 是否覆盖具有相同名称的现有集合。默认为 False。

包含文档 ID 和文本的标量字段

doc_id_field (str, optional): 集合的 doc_id 字段名称。默认为 DEFAULT_DOC_ID_KEY。
text_key (str, optional): 在传入集合中存储文本的键。在自带集合时使用。默认为 DEFAULT_TEXT_KEY。
scalar_field_names (list, optional): 要包含在集合模式中的额外标量字段名称。
scalar_field_types (list, optional): 额外标量字段的类型。

稠密字段

enable_dense (bool): 启用或禁用稠密嵌入的布尔标志。默认为 True。
dim (int, optional): 集合的嵌入向量维度。在创建 enable_sparse 为 False 的新集合时必需。
embedding_field (str, optional): 集合的稠密嵌入字段名称，默认为 DEFAULT_EMBEDDING_KEY。
index_config (dict, optional): 用于构建稠密嵌入索引的配置。默认为 None。
search_config (dict, optional): 用于搜索 Milvus 稠密索引的配置。请注意，这必须与 index_config 指定的索引类型兼容。默认为 None。
similarity_metric (str, optional): 用于稠密嵌入的相似度指标，目前支持 IP, COSINE 和 L2。

稀疏字段

enable_sparse (bool): 启用或禁用稀疏嵌入的布尔标志。默认为 False。
sparse_embedding_field (str): 稀疏嵌入字段名称，默认为 DEFAULT_SPARSE_EMBEDDING_KEY。
sparse_embedding_function (Union[BaseSparseEmbeddingFunction, BaseMilvusBuiltInFunction], optional): 如果 enable_sparse 为 True，则应提供此对象将文本转换为稀疏嵌入。如果为 None，将使用默认的稀疏嵌入函数 (BM25BuiltInFunction)。
sparse_index_config (dict, optional): 用于构建稀疏嵌入索引的配置。默认为 None。

混合排序器

hybrid_ranker (str): 指定在混合搜索查询中使用的排序器类型。目前仅支持 ["RRFRanker", "WeightedRanker"]。默认为 "RRFRanker"。
hybrid_ranker_params (dict, optional): 混合排序器的配置参数。此字典的结构取决于所使用的具体排序器。
- 对于 "RRFRanker"，应包含
  - "k" (int): 在倒数排名融合 (RRF) 中使用的参数。此值用于计算 RRF 算法中的排名分数，该算法将多个排名策略组合成一个分数以提高搜索相关性。
- 对于 "WeightedRanker"，需要
  - "weights" (list of float): 一个包含两个权重的列表
    1. 稠密嵌入组件的权重。
    2. 稀疏嵌入组件的权重。这些权重用于在混合检索过程中调整嵌入的稠密和稀疏组件的重要性。默认为空字典，表示排序器将按照其预定义的默认设置运行。

其他

collection_properties (dict, optional): 集合属性，例如 TTL (Time-To-Live) 和 MMAP (内存映射)。默认为 None。可以包含
- "collection.ttl.seconds" (int): 设置此属性后，当前集合中的数据将在指定时间后过期。集合中过期的数据将被清除，不再参与搜索或查询。
- "mmap.enabled" (bool): 是否在集合级别启用内存映射存储。
index_management (IndexManagement): 指定要使用的索引管理策略。默认为 "create_if_not_exists"。
batch_size (int): 配置将数据插入 Milvus 时单批处理的文档数量。默认为 DEFAULT_BATCH_SIZE。
consistency_level (str, optional): 新创建集合的一致性级别。默认为 "Session"。

In [ ]

已复制！





# Create an index over the documents
from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.vector_stores.milvus import MilvusVectorStore


vector_store = MilvusVectorStore(
    uri="./milvus_demo.db", dim=1536, overwrite=True
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)
# Create an index over the documents from llama_index.core import VectorStoreIndex, StorageContext from llama_index.vector_stores.milvus import MilvusVectorStore vector_store = MilvusVectorStore( uri="./milvus_demo.db", dim=1536, overwrite=True ) storage_context = StorageContext.from_defaults(vector_store=vector_store) index = VectorStoreIndex.from_documents( documents, storage_context=storage_context )

关于 MilvusVectorStore 的参数

将 uri 设置为本地文件，例如 ./milvus.db，是最便捷的方法，因为它会自动利用 Milvus Lite 将所有数据存储在该文件中。

如果您的数据量很大，您可以在 docker 或 kubernetes 上设置更高性能的 Milvus 服务器。在此设置中，请使用服务器 URI，例如 http://localhost:19530，作为您的 uri。

如果您想使用 Zilliz Cloud（Milvus 的全托管云服务），请调整 uri 和 token，它们对应于 Zilliz Cloud 中的公共端点和 API 密钥。

查询数据¶

现在我们的文档已经存储在索引中，我们可以针对索引提问。索引将使用其中存储的数据作为 ChatGPT 的知识库。

In [ ]

已复制！

query_engine = index.as_query_engine()
res = query_engine.query("What did the author learn?")
print(res)
query_engine = index.as_query_engine() res = query_engine.query("What did the author learn?") print(res)

The author learned that philosophy courses in college were boring to him, leading him to switch his focus to studying AI.

In [ ]

已复制！

res = query_engine.query(
    "What challenges did the disease pose for the author?"
)
print(res)
res = query_engine.query( "What challenges did the disease pose for the author?" ) print(res)

The disease posed challenges for the author as it affected his mother's health, leading to a stroke caused by colon cancer. This resulted in her losing her balance and needing to be placed in a nursing home. The author and his sister were determined to help their mother get out of the nursing home and back to her house.

下一个测试演示了覆盖会移除之前的数据。

In [ ]

已复制！





from llama_index.core import Document


vector_store = MilvusVectorStore(
    uri="./milvus_demo.db", dim=1536, overwrite=True
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    [Document(text="The number that is being searched for is ten.")],
    storage_context,
)
query_engine = index.as_query_engine()
res = query_engine.query("Who is the author?")
print(res)
from llama_index.core import Document vector_store = MilvusVectorStore( uri="./milvus_demo.db", dim=1536, overwrite=True ) storage_context = StorageContext.from_defaults(vector_store=vector_store) index = VectorStoreIndex.from_documents( [Document(text="The number that is being searched for is ten.")], storage_context, ) query_engine = index.as_query_engine() res = query_engine.query("Who is the author?") print(res)

The author is the individual who created the context information.

下一个测试演示了向已存在的索引添加额外数据。

In [ ]

已复制！





del index, vector_store, storage_context, query_engine

vector_store = MilvusVectorStore(uri="./milvus_demo.db", overwrite=False)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)
query_engine = index.as_query_engine()
res = query_engine.query("What is the number?")
print(res)
del index, vector_store, storage_context, query_engine vector_store = MilvusVectorStore(uri="./milvus_demo.db", overwrite=False) storage_context = StorageContext.from_defaults(vector_store=vector_store) index = VectorStoreIndex.from_documents( documents, storage_context=storage_context ) query_engine = index.as_query_engine() res = query_engine.query("What is the number?") print(res)

The number is ten.

In [ ]

已复制！

res = query_engine.query("Who is the author?")
print(res)
res = query_engine.query("Who is the author?") print(res)

Paul Graham

元数据过滤¶

我们可以通过过滤特定来源来生成结果。以下示例演示了如何从目录加载所有文档，然后根据元数据进行过滤。

In [ ]

已复制！

from llama_index.core.vector_stores import ExactMatchFilter, MetadataFilters

# Load all the two documents loaded before
documents_all = SimpleDirectoryReader("./data/").load_data()

vector_store = MilvusVectorStore(
    uri="./milvus_demo.db", dim=1536, overwrite=True
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(documents_all, storage_context)
from llama_index.core.vector_stores import ExactMatchFilter, MetadataFilters # Load all the two documents loaded before documents_all = SimpleDirectoryReader("./data/").load_data() vector_store = MilvusVectorStore( uri="./milvus_demo.db", dim=1536, overwrite=True ) storage_context = StorageContext.from_defaults(vector_store=vector_store) index = VectorStoreIndex.from_documents(documents_all, storage_context)

我们只想检索来自文件 uber_2021.pdf 的文档。

In [ ]

已复制！





filters = MetadataFilters(
    filters=[ExactMatchFilter(key="file_name", value="uber_2021.pdf")]
)
query_engine = index.as_query_engine(filters=filters)
res = query_engine.query(
    "What challenges did the disease pose for the author?"
)

print(res)
filters = MetadataFilters( filters=[ExactMatchFilter(key="file_name", value="uber_2021.pdf")] ) query_engine = index.as_query_engine(filters=filters) res = query_engine.query( "What challenges did the disease pose for the author?" ) print(res)

The disease posed challenges related to the adverse impact on the business and operations, including reduced demand for Mobility offerings globally, affecting travel behavior and demand. Additionally, the pandemic led to driver supply constraints, impacted by concerns regarding COVID-19, with uncertainties about when supply levels would return to normal. The rise of the Omicron variant further affected travel, resulting in advisories and restrictions that could adversely impact both driver supply and consumer demand for Mobility offerings.

这次从文件 paul_graham_essay.txt 中检索时，我们得到了不同的结果。

In [ ]

已复制！





filters = MetadataFilters(
    filters=[ExactMatchFilter(key="file_name", value="paul_graham_essay.txt")]
)
query_engine = index.as_query_engine(filters=filters)
res = query_engine.query(
    "What challenges did the disease pose for the author?"
)

print(res)
filters = MetadataFilters( filters=[ExactMatchFilter(key="file_name", value="paul_graham_essay.txt")] ) query_engine = index.as_query_engine(filters=filters) res = query_engine.query( "What challenges did the disease pose for the author?" ) print(res)

The disease posed challenges for the author as it affected his mother's health, leading to a stroke caused by colon cancer. This resulted in his mother losing her balance and needing to be placed in a nursing home. The author and his sister were determined to help their mother get out of the nursing home and back to her house.