腾讯云向量数据库¶

腾讯云向量数据库 (Tencent Cloud VectorDB) 是一个全托管、自研、企业级的分布式数据库服务，专为存储、检索和分析多维向量数据而设计。数据库支持多种索引类型和相似度计算方法。单索引可支持高达 10 亿向量规模，可支持百万级 QPS 和毫秒级查询延迟。腾讯云向量数据库不仅可以为大型模型提供外部知识库以提高大型模型响应的准确性，还可以广泛应用于推荐系统、自然语言处理服务、计算机视觉和智能客服等人工智能领域。

本 Notebook 展示了如何在 LlamaIndex 中使用 TencentVectorDB 作为向量存储的基本用法。

要运行，您需要有一个数据库实例。

设置¶

如果您在 colab 上打开此 Notebook，您可能需要安装 LlamaIndex 🦙。

In [ ]

已复制！

%pip install llama-index-vector-stores-tencentvectordb
%pip install llama-index-vector-stores-tencentvectordb

In [ ]

已复制！

!pip install llama-index
!pip install llama-index

In [ ]

已复制！

!pip install tcvectordb
!pip install tcvectordb

In [ ]

已复制！





from llama_index.core import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    StorageContext,
)
from llama_index.vector_stores.tencentvectordb import TencentVectorDB
from llama_index.core.vector_stores.tencentvectordb import (
    CollectionParams,
    FilterField,
)
import tcvectordb

tcvectordb.debug.DebugEnable = False
from llama_index.core import ( VectorStoreIndex, SimpleDirectoryReader, StorageContext, ) from llama_index.vector_stores.tencentvectordb import TencentVectorDB from llama_index.core.vector_stores.tencentvectordb import ( CollectionParams, FilterField, ) import tcvectordb tcvectordb.debug.DebugEnable = False

请提供 OpenAI 访问密钥¶

为了使用 OpenAI 的嵌入，您需要提供一个 OpenAI API 密钥。

In [ ]

已复制！

import openai

OPENAI_API_KEY = getpass.getpass("OpenAI API Key:")
openai.api_key = OPENAI_API_KEY
import openai OPENAI_API_KEY = getpass.getpass("OpenAI API Key:") openai.api_key = OPENAI_API_KEY

OpenAI API Key: ········

下载数据¶

In [ ]

已复制！

!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
!mkdir -p 'data/paul_graham/' !wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

创建并填充向量存储¶

现在，您将从本地文件加载 Paul Graham 的一些文章并将它们存储到腾讯云向量数据库中。

In [ ]

已复制！





# load documents
documents = SimpleDirectoryReader("./data/paul_graham").load_data()
print(f"Total documents: {len(documents)}")
print(f"First document, id: {documents[0].doc_id}")
print(f"First document, hash: {documents[0].hash}")
print(
    f"First document, text ({len(documents[0].text)} characters):\n{'='*20}\n{documents[0].text[:360]} ..."
)
# 加载文档 documents = SimpleDirectoryReader("./data/paul_graham").load_data() print(f"总文档数: {len(documents)}") print(f"第一个文档，ID: {documents[0].doc_id}") print(f"第一个文档，哈希: {documents[0].hash}") print( f"第一个文档，文本 ({len(documents[0].text)} 字符):\n{'='*20}\n{documents[0].text[:360]} ..." )

Total documents: 1
First document, id: 5b7489b6-0cca-4088-8f30-6de32d540fdf
First document, hash: 4c702b4df575421e1d1af4b1fd50511b226e0c9863dbfffeccb8b689b8448f35
First document, text (75019 characters):
====================
		

What I Worked On

February 2021

Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined  ...

初始化腾讯云向量数据库¶

向量存储的创建涉及创建底层数据库集合（如果尚不存在）。

In [ ]

已复制！

vector_store = TencentVectorDB(
    url="http://10.0.X.X",
    key="eC4bLRy2va******************************",
    collection_params=CollectionParams(dimension=1536, drop_exists=True),
)
vector_store = TencentVectorDB( url="http://10.0.X.X", key="eC4bLRy2va******************************", collection_params=CollectionParams(dimension=1536, drop_exists=True), )

现在将此存储封装到 LlamaIndex 的 index 抽象中，以便后续查询。

In [ ]

已复制！

storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)
storage_context = StorageContext.from_defaults(vector_store=vector_store) index = VectorStoreIndex.from_documents( documents, storage_context=storage_context )

请注意，上面的 from_documents 调用一次执行多项操作：它将输入文档分割成可管理大小的块（“节点”），为每个节点计算嵌入向量，并将它们全部存储到腾讯云向量数据库中。

查询存储¶

基本查询¶

In [ ]

已复制！

query_engine = index.as_query_engine()
response = query_engine.query("Why did the author choose to work on AI?")
print(response)
# query_engine = index.as_query_engine() response = query_engine.query("作者为什么选择研究 AI？") print(response)

The author chose to work on AI because of his fascination with the novel The Moon is a Harsh Mistress, which featured an intelligent computer called Mike, and a PBS documentary that showed Terry Winograd using SHRDLU. He was also drawn to the idea that AI could be used to explore the ultimate truths that other fields could not.

基于 MMR 的查询¶

MMR（最大边缘相关性）方法旨在从存储中获取与查询同时相关但彼此之间尽可能不同的文本块，目的是为构建最终答案提供更广泛的上下文。

In [ ]

已复制！

query_engine = index.as_query_engine(vector_store_query_mode="mmr")
response = query_engine.query("Why did the author choose to work on AI?")
print(response)
query_engine = index.as_query_engine(vector_store_query_mode="mmr") response = query_engine.query("作者为什么选择研究 AI？") print(response)

The author chose to work on AI because he was impressed and envious of his friend who had built a computer kit and was able to type programs into it. He was also inspired by a novel by Heinlein called The Moon is a Harsh Mistress, which featured an intelligent computer called Mike, and a PBS documentary that showed Terry Winograd using SHRDLU. He was also disappointed with philosophy courses in college, which he found to be boring, and he wanted to work on something that seemed more powerful.

连接现有存储¶

由于此存储由腾讯云向量数据库支持，它在定义上是持久化的。因此，如果您想连接到之前创建和填充过的存储，方法如下：

In [ ]

已复制！





new_vector_store = TencentVectorDB(
    url="http://10.0.X.X",
    key="eC4bLRy2va******************************",
    collection_params=CollectionParams(dimension=1536, drop_exists=False),
)

# Create index (from preexisting stored vectors)
new_index_instance = VectorStoreIndex.from_vector_store(
    vector_store=new_vector_store
)

# now you can do querying, etc:
query_engine = index.as_query_engine(similarity_top_k=5)
response = query_engine.query(
    "What did the author study prior to working on AI?"
)
new_vector_store = TencentVectorDB( url="http://10.0.X.X", key="eC4bLRy2va******************************", collection_params=CollectionParams(dimension=1536, drop_exists=False), ) # 创建索引（从预先存在的存储向量） new_index_instance = VectorStoreIndex.from_vector_store( vector_store=new_vector_store ) # 现在您可以进行查询等操作： query_engine = index.as_query_engine(similarity_top_k=5) response = query_engine.query( "作者在研究 AI 之前学了什么？" )

In [ ]

已复制！

print(response)
print(response)

The author studied philosophy and painting, worked on spam filters, and wrote essays prior to working on AI.

从索引中删除文档¶

首先从索引派生的 Retriever 中获取文档片段或“节点”的明确列表。

In [ ]

已复制！





retriever = new_index_instance.as_retriever(
    vector_store_query_mode="mmr",
    similarity_top_k=3,
    vector_store_kwargs={"mmr_prefetch_factor": 4},
)
nodes_with_scores = retriever.retrieve(
    "What did the author study prior to working on AI?"
)
retriever = new_index_instance.as_retriever( vector_store_query_mode="mmr", similarity_top_k=3, vector_store_kwargs={"mmr_prefetch_factor": 4}, ) nodes_with_scores = retriever.retrieve( "作者在研究 AI 之前学了什么？" )

In [ ]

已复制！

print(f"Found {len(nodes_with_scores)} nodes.")
for idx, node_with_score in enumerate(nodes_with_scores):
    print(f"    [{idx}] score = {node_with_score.score}")
    print(f"        id    = {node_with_score.node.node_id}")
    print(f"        text  = {node_with_score.node.text[:90]} ...")
print(f"找到 {len(nodes_with_scores)} 个节点。") for idx, node_with_score in enumerate(nodes_with_scores): print(f" [{idx}] 分数 = {node_with_score.score}") print(f" ID = {node_with_score.node.node_id}") print(f" 文本 = {node_with_score.node.text[:90]} ...")

Found 3 nodes.
    [0] score = 0.42589144520149874
        id    = 05f53f06-9905-461a-bc6d-fa4817e5a776
        text  = What I Worked On

February 2021

Before college the two main things I worked on, outside o ...
    [1] score = -0.0012061281453193962
        id    = 2f9f843e-6495-4646-a03d-4b844ff7c1ab
        text  = been explored. But all I wanted was to get out of grad school, and my rapidly written diss ...
    [2] score = 0.025454533089838027
        id    = 28ad32da-25f9-4aaa-8487-88390ec13348
        text  = showed Terry Winograd using SHRDLU. I haven't tried rereading The Moon is a Harsh Mistress ...

但是等等！使用向量存储时，您应该将文档视为合理的删除单元，而不是属于它的任何单个节点。在这种情况下，您只插入了一个文本文件，因此所有节点都将具有相同的 ref_doc_id。

In [ ]

已复制！

print("Nodes' ref_doc_id:")
print("\n".join([nws.node.ref_doc_id for nws in nodes_with_scores]))
print("节点的 ref_doc_id:") print("\n".join([nws.node.ref_doc_id for nws in nodes_with_scores]))

Nodes' ref_doc_id:
5b7489b6-0cca-4088-8f30-6de32d540fdf
5b7489b6-0cca-4088-8f30-6de32d540fdf
5b7489b6-0cca-4088-8f30-6de32d540fdf

现在假设您需要删除您上传的文本文件。

In [ ]

已复制！

new_vector_store.delete(nodes_with_scores[0].node.ref_doc_id)
new_vector_store.delete(nodes_with_scores[0].node.ref_doc_id)

现在重复完全相同的查询并检查结果。您应该会看到 没有找到任何结果。

In [ ]

已复制！

nodes_with_scores = retriever.retrieve(
    "What did the author study prior to working on AI?"
)

print(f"Found {len(nodes_with_scores)} nodes.")
nodes_with_scores = retriever.retrieve( "作者在研究 AI 之前学了什么？" ) print(f"找到 {len(nodes_with_scores)} 个节点。")

Found 0 nodes.

元数据过滤¶

腾讯云向量数据库向量存储支持在查询时以精确匹配 key=value 对的形式进行元数据过滤。以下单元格在一个全新的集合上运行，演示了此功能。

在此演示中，为简洁起见，仅加载了一个源文档（../data/paul_graham/paul_graham_essay.txt 文本文件）。不过，您将向文档附加一些自定义元数据，以说明如何使用附加到文档的元数据上的条件来限制查询。

In [ ]

已复制！





filter_fields = [
    FilterField(name="source_type"),
]

md_storage_context = StorageContext.from_defaults(
    vector_store=TencentVectorDB(
        url="http://10.0.X.X",
        key="eC4bLRy2va******************************",
        collection_params=CollectionParams(
            dimension=1536, drop_exists=True, filter_fields=filter_fields
        ),
    )
)


def my_file_metadata(file_name: str):
    """Depending on the input file name, associate a different metadata."""
    if "essay" in file_name:
        source_type = "essay"
    elif "dinosaur" in file_name:
        # this (unfortunately) will not happen in this demo
        source_type = "dinos"
    else:
        source_type = "other"
    return {"source_type": source_type}


# Load documents and build index
md_documents = SimpleDirectoryReader(
    "../data/paul_graham", file_metadata=my_file_metadata
).load_data()
md_index = VectorStoreIndex.from_documents(
    md_documents, storage_context=md_storage_context
)
filter_fields = [ FilterField(name="source_type"), ] md_storage_context = StorageContext.from_defaults( vector_store=TencentVectorDB( url="http://10.0.X.X", key="eC4bLRy2va******************************", collection_params=CollectionParams( dimension=1536, drop_exists=True, filter_fields=filter_fields ), ) ) def my_file_metadata(file_name: str): """Depending on the input file name, associate a different metadata.""" if "essay" in file_name: source_type = "essay" elif "dinosaur" in file_name: # this (unfortunately) will not happen in this demo source_type = "dinos" else: source_type = "other" return {"source_type": source_type} # Load documents and build index md_documents = SimpleDirectoryReader( "../data/paul_graham", file_metadata=my_file_metadata ).load_data() md_index = VectorStoreIndex.from_documents( md_documents, storage_context=md_storage_context )

就这样：你现在可以为你的查询引擎添加过滤功能了

In [ ]

已复制！

from llama_index.core.vector_stores import ExactMatchFilter, MetadataFilters
from llama_index.core.vector_stores import ExactMatchFilter, MetadataFilters

In [ ]

已复制！





md_query_engine = md_index.as_query_engine(
    filters=MetadataFilters(
        filters=[ExactMatchFilter(key="source_type", value="essay")]
    )
)
md_response = md_query_engine.query(
    "How long it took the author to write his thesis?"
)
print(md_response.response)
md_query_engine = md_index.as_query_engine( filters=MetadataFilters( filters=[ExactMatchFilter(key="source_type", value="essay")] ) ) md_response = md_query_engine.query( "How long it took the author to write his thesis?" ) print(md_response.response)

It took the author five weeks to write his thesis.

为了测试过滤功能是否生效，尝试将其更改为仅使用 "dinos" 文档... 这次将没有答案 :)