百度 VectorDB¶

百度 VectorDB 是百度智能云精心开发和全面管理的强大企业级分布式数据库服务。它在存储、检索和分析多维向量数据方面具有卓越能力。VectorDB 的核心是百度专有的“Mochow”向量数据库内核，可确保高性能、高可用性和安全性，同时具有出色的可伸缩性和用户友好性。

该数据库服务支持多种索引类型和相似度计算方法，可满足各种用例的需求。VectorDB 的一个突出特点是能够管理高达 100 亿的大规模向量，同时保持卓越的查询性能，支持每秒数百万次查询 (QPS)，查询延迟达到毫秒级。

本 Notebook 展示了百度 VectorDB 在 LlamaIndex 中作为向量存储的基本用法。

要运行，您应该拥有一个数据库实例。

设置¶

如果您在 Colab 上打开本 Notebook，您可能需要安装 LlamaIndex 🦙。

In [ ]

Copied!

%pip install llama-index-vector-stores-baiduvectordb
%pip install llama-index-vector-stores-baiduvectordb

In [ ]

Copied!

!pip install llama-index
!pip install llama-index

In [ ]

Copied!

!pip install pymochow
!pip install pymochow

In [ ]

Copied!





from llama_index.core import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    StorageContext,
)
from llama_index.vector_stores.baiduvectordb import (
    BaiduVectorDB,
    TableParams,
    TableField,
)
import pymochow
from llama_index.core import ( VectorStoreIndex, SimpleDirectoryReader, StorageContext, ) from llama_index.vector_stores.baiduvectordb import ( BaiduVectorDB, TableParams, TableField, ) import pymochow

请提供 OpenAI 访问密钥¶

为了使用 OpenAI 的嵌入功能，您需要提供一个 OpenAI API 密钥

In [ ]

Copied!

import openai

OPENAI_API_KEY = getpass.getpass("OpenAI API Key:")
openai.api_key = OPENAI_API_KEY
import openai OPENAI_API_KEY = getpass.getpass("OpenAI API Key:") openai.api_key = OPENAI_API_KEY

下载数据¶

In [ ]

Copied!

!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
!mkdir -p 'data/paul_graham/' !wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

创建并填充向量存储¶

现在，您将从本地文件加载 Paul Graham 的一些文章，并将它们存储到百度 VectorDB 中。

In [ ]

Copied!





# load documents
documents = SimpleDirectoryReader("./data/paul_graham").load_data()
print(f"Total documents: {len(documents)}")
print(f"First document, id: {documents[0].doc_id}")
print(f"First document, hash: {documents[0].hash}")
print(
    f"First document, text ({len(documents[0].text)} characters):\n{'='*20}\n{documents[0].text[:360]} ..."
)
# load documents documents = SimpleDirectoryReader("./data/paul_graham").load_data() print(f"总文档数： {len(documents)}") print(f"第一个文档，ID： {documents[0].doc_id}") print(f"第一个文档，哈希： {documents[0].hash}") print( f"第一个文档，文本 ({len(documents[0].text)} 个字符):\n{'='*20}\n{documents[0].text[:360]} ..." )

初始化百度 VectorDB¶

创建向量存储意味着如果底层数据库集合尚不存在，则会创建它

In [ ]

Copied!

vector_store = BaiduVectorDB(
    endpoint="http://192.168.X.X",
    api_key="*******",
    table_params=TableParams(dimension=1536, drop_exists=True),
)
vector_store = BaiduVectorDB( endpoint="http://192.168.X.X", api_key="*******", table_params=TableParams(dimension=1536, drop_exists=True), )

现在将此存储包装到一个 index LlamaIndex 抽象中，以便后续查询

In [ ]

Copied!

storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)
storage_context = StorageContext.from_defaults(vector_store=vector_store) index = VectorStoreIndex.from_documents( documents, storage_context=storage_context )

请注意，上面的 from_documents 调用同时执行了多项操作：它将输入文档分割成可管理大小的块（“节点”），计算每个节点的嵌入向量，并将它们全部存储在百度 VectorDB 中。

查询存储¶

基本查询¶

In [ ]

Copied!

query_engine = index.as_query_engine()
response = query_engine.query("Why did the author choose to work on AI?")
print(response)
query_engine = index.as_query_engine() response = query_engine.query("作者为什么选择从事 AI 工作？") print(response)

基于 MMR 的查询¶

MMR（最大边缘相关性）方法旨在从存储中获取文本块，这些文本块既与查询相关，又尽可能彼此不同，目的是为构建最终答案提供更广泛的上下文

In [ ]

Copied!

query_engine = index.as_query_engine(vector_store_query_mode="mmr")
response = query_engine.query("Why did the author choose to work on AI?")
print(response)
query_engine = index.as_query_engine(vector_store_query_mode="mmr") response = query_engine.query("作者为什么选择从事 AI 工作？") print(response)

连接到现有存储¶

由于此存储由百度 VectorDB 支持，因此它本质上是持久的。因此，如果您想连接到之前创建并填充的存储，方法如下

In [ ]

Copied!





vector_store = BaiduVectorDB(
    endpoint="http://192.168.X.X",
    api_key="*******",
    table_params=TableParams(dimension=1536, drop_exists=False),
)

# Create index (from preexisting stored vectors)
new_index_instance = VectorStoreIndex.from_vector_store(
    vector_store=new_vector_store
)

# now you can do querying, etc:
query_engine = index.as_query_engine(similarity_top_k=5)
response = query_engine.query(
    "What did the author study prior to working on AI?"
)
print(response)
vector_store = BaiduVectorDB( endpoint="http://192.168.X.X", api_key="*******", table_params=TableParams(dimension=1536, drop_exists=False), ) # Create index (from preexisting stored vectors) new_index_instance = VectorStoreIndex.from_vector_store( vector_store=new_vector_store ) # now you can do querying, etc: query_engine = index.as_query_engine(similarity_top_k=5) response = query_engine.query( "作者在从事 AI 工作之前学习了什么？" ) print(response)

元数据过滤¶

百度 VectorDB 向量存储支持在查询时以精确匹配 key=value 对的形式进行元数据过滤。以下单元格将在全新的集合上运行，演示此功能。

在此演示中，为简洁起见，只加载一个源文档（即 ../data/paul_graham/paul_graham_essay.txt 文本文件）。不过，您将向文档附加一些自定义元数据，以演示如何使用附加到文档的元数据条件来限制查询。

In [ ]

Copied!





filter_fields = [
    TableField(name="source_type"),
]

md_storage_context = StorageContext.from_defaults(
    vector_store=BaiduVectorDB(
        endpoint="http://192.168.X.X",
        api_key="="*******",",
        table_params=TableParams(
            dimension=1536, drop_exists=True, filter_fields=filter_fields
        ),
    )
)


def my_file_metadata(file_name: str):
    """Depending on the input file name, associate a different metadata."""
    if "essay" in file_name:
        source_type = "essay"
    elif "dinosaur" in file_name:
        # this (unfortunately) will not happen in this demo
        source_type = "dinos"
    else:
        source_type = "other"
    return {"source_type": source_type}


# Load documents and build index
md_documents = SimpleDirectoryReader(
    "../data/paul_graham", file_metadata=my_file_metadata
).load_data()
md_index = VectorStoreIndex.from_documents(
    md_documents, storage_context=md_storage_context
)
filter_fields = [ TableField(name="source_type"), ] md_storage_context = StorageContext.from_defaults( vector_store=BaiduVectorDB( endpoint="http://192.168.X.X", api_key="="*******",", table_params=TableParams( dimension=1536, drop_exists=True, filter_fields=filter_fields ), ) ) def my_file_metadata(file_name: str): """根据输入文件名，关联不同的元数据。""" if "essay" in file_name: source_type = "essay" elif "dinosaur" in file_name: # （不幸的是）这在此演示中不会发生 source_type = "dinos" else: source_type = "other" return {"source_type": source_type} # 加载文档并构建索引 md_documents = SimpleDirectoryReader( "../data/paul_graham", file_metadata=my_file_metadata ).load_data() md_index = VectorStoreIndex.from_documents( md_documents, storage_context=md_storage_context )

In [ ]

Copied!

from llama_index.core.vector_stores import MetadataFilter, MetadataFilters
from llama_index.core.vector_stores import MetadataFilter, MetadataFilters

In [ ]

Copied!





md_query_engine = md_index.as_query_engine(
    filters=MetadataFilters(
        filters=[MetadataFilter(key="source_type", value="essay")]
    )
)
md_response = md_query_engine.query(
    "How long it took the author to write his thesis?"
)
print(md_response.response)
md_query_engine = md_index.as_query_engine( filters=MetadataFilters( filters=[MetadataFilter(key="source_type", value="essay")] ) ) md_response = md_query_engine.query( "作者写他的论文花了多长时间？" ) print(md_response.response)