腾讯云向量数据库¶
腾讯云向量数据库 (Tencent Cloud VectorDB) 是一个全托管、自研、企业级的分布式数据库服务,专为存储、检索和分析多维向量数据而设计。数据库支持多种索引类型和相似度计算方法。单索引可支持高达 10 亿向量规模,可支持百万级 QPS 和毫秒级查询延迟。腾讯云向量数据库不仅可以为大型模型提供外部知识库以提高大型模型响应的准确性,还可以广泛应用于推荐系统、自然语言处理服务、计算机视觉和智能客服等人工智能领域。
本 Notebook 展示了如何在 LlamaIndex 中使用 TencentVectorDB 作为向量存储的基本用法。
要运行,您需要有一个 数据库实例。
设置¶
如果您在 colab 上打开此 Notebook,您可能需要安装 LlamaIndex 🦙。
%pip install llama-index-vector-stores-tencentvectordb
!pip install llama-index
!pip install tcvectordb
from llama_index.core import (
VectorStoreIndex,
SimpleDirectoryReader,
StorageContext,
)
from llama_index.vector_stores.tencentvectordb import TencentVectorDB
from llama_index.core.vector_stores.tencentvectordb import (
CollectionParams,
FilterField,
)
import tcvectordb
tcvectordb.debug.DebugEnable = False
请提供 OpenAI 访问密钥¶
为了使用 OpenAI 的嵌入,您需要提供一个 OpenAI API 密钥。
import openai
OPENAI_API_KEY = getpass.getpass("OpenAI API Key:")
openai.api_key = OPENAI_API_KEY
OpenAI API Key: ········
下载数据¶
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
创建并填充向量存储¶
现在,您将从本地文件加载 Paul Graham 的一些文章并将它们存储到腾讯云向量数据库中。
# load documents
documents = SimpleDirectoryReader("./data/paul_graham").load_data()
print(f"Total documents: {len(documents)}")
print(f"First document, id: {documents[0].doc_id}")
print(f"First document, hash: {documents[0].hash}")
print(
f"First document, text ({len(documents[0].text)} characters):\n{'='*20}\n{documents[0].text[:360]} ..."
)
Total documents: 1 First document, id: 5b7489b6-0cca-4088-8f30-6de32d540fdf First document, hash: 4c702b4df575421e1d1af4b1fd50511b226e0c9863dbfffeccb8b689b8448f35 First document, text (75019 characters): ==================== What I Worked On February 2021 Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined ...
初始化腾讯云向量数据库¶
向量存储的创建涉及创建底层数据库集合(如果尚不存在)。
vector_store = TencentVectorDB(
url="http://10.0.X.X",
key="eC4bLRy2va******************************",
collection_params=CollectionParams(dimension=1536, drop_exists=True),
)
现在将此存储封装到 LlamaIndex 的 index
抽象中,以便后续查询。
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
documents, storage_context=storage_context
)
请注意,上面的 from_documents
调用一次执行多项操作:它将输入文档分割成可管理大小的块(“节点”),为每个节点计算嵌入向量,并将它们全部存储到腾讯云向量数据库中。
查询存储¶
基本查询¶
query_engine = index.as_query_engine()
response = query_engine.query("Why did the author choose to work on AI?")
print(response)
The author chose to work on AI because of his fascination with the novel The Moon is a Harsh Mistress, which featured an intelligent computer called Mike, and a PBS documentary that showed Terry Winograd using SHRDLU. He was also drawn to the idea that AI could be used to explore the ultimate truths that other fields could not.
基于 MMR 的查询¶
MMR(最大边缘相关性)方法旨在从存储中获取与查询同时相关但彼此之间尽可能不同的文本块,目的是为构建最终答案提供更广泛的上下文。
query_engine = index.as_query_engine(vector_store_query_mode="mmr")
response = query_engine.query("Why did the author choose to work on AI?")
print(response)
The author chose to work on AI because he was impressed and envious of his friend who had built a computer kit and was able to type programs into it. He was also inspired by a novel by Heinlein called The Moon is a Harsh Mistress, which featured an intelligent computer called Mike, and a PBS documentary that showed Terry Winograd using SHRDLU. He was also disappointed with philosophy courses in college, which he found to be boring, and he wanted to work on something that seemed more powerful.
连接现有存储¶
由于此存储由腾讯云向量数据库支持,它在定义上是持久化的。因此,如果您想连接到之前创建和填充过的存储,方法如下:
new_vector_store = TencentVectorDB(
url="http://10.0.X.X",
key="eC4bLRy2va******************************",
collection_params=CollectionParams(dimension=1536, drop_exists=False),
)
# Create index (from preexisting stored vectors)
new_index_instance = VectorStoreIndex.from_vector_store(
vector_store=new_vector_store
)
# now you can do querying, etc:
query_engine = index.as_query_engine(similarity_top_k=5)
response = query_engine.query(
"What did the author study prior to working on AI?"
)
print(response)
The author studied philosophy and painting, worked on spam filters, and wrote essays prior to working on AI.
从索引中删除文档¶
首先从索引派生的 Retriever
中获取文档片段或“节点”的明确列表。
retriever = new_index_instance.as_retriever(
vector_store_query_mode="mmr",
similarity_top_k=3,
vector_store_kwargs={"mmr_prefetch_factor": 4},
)
nodes_with_scores = retriever.retrieve(
"What did the author study prior to working on AI?"
)
print(f"Found {len(nodes_with_scores)} nodes.")
for idx, node_with_score in enumerate(nodes_with_scores):
print(f" [{idx}] score = {node_with_score.score}")
print(f" id = {node_with_score.node.node_id}")
print(f" text = {node_with_score.node.text[:90]} ...")
Found 3 nodes. [0] score = 0.42589144520149874 id = 05f53f06-9905-461a-bc6d-fa4817e5a776 text = What I Worked On February 2021 Before college the two main things I worked on, outside o ... [1] score = -0.0012061281453193962 id = 2f9f843e-6495-4646-a03d-4b844ff7c1ab text = been explored. But all I wanted was to get out of grad school, and my rapidly written diss ... [2] score = 0.025454533089838027 id = 28ad32da-25f9-4aaa-8487-88390ec13348 text = showed Terry Winograd using SHRDLU. I haven't tried rereading The Moon is a Harsh Mistress ...
但是等等!使用向量存储时,您应该将 文档 视为合理的删除单元,而不是属于它的任何单个节点。在这种情况下,您只插入了一个文本文件,因此所有节点都将具有相同的 ref_doc_id
。
print("Nodes' ref_doc_id:")
print("\n".join([nws.node.ref_doc_id for nws in nodes_with_scores]))
Nodes' ref_doc_id: 5b7489b6-0cca-4088-8f30-6de32d540fdf 5b7489b6-0cca-4088-8f30-6de32d540fdf 5b7489b6-0cca-4088-8f30-6de32d540fdf
现在假设您需要删除您上传的文本文件。
new_vector_store.delete(nodes_with_scores[0].node.ref_doc_id)
现在重复完全相同的查询并检查结果。您应该会看到 没有找到任何结果。
nodes_with_scores = retriever.retrieve(
"What did the author study prior to working on AI?"
)
print(f"Found {len(nodes_with_scores)} nodes.")
Found 0 nodes.
元数据过滤¶
腾讯云向量数据库向量存储支持在查询时以精确匹配 key=value
对的形式进行元数据过滤。以下单元格在一个全新的集合上运行,演示了此功能。
在此演示中,为简洁起见,仅加载了一个源文档(../data/paul_graham/paul_graham_essay.txt
文本文件)。不过,您将向文档附加一些自定义元数据,以说明如何使用附加到文档的元数据上的条件来限制查询。
filter_fields = [
FilterField(name="source_type"),
]
md_storage_context = StorageContext.from_defaults(
vector_store=TencentVectorDB(
url="http://10.0.X.X",
key="eC4bLRy2va******************************",
collection_params=CollectionParams(
dimension=1536, drop_exists=True, filter_fields=filter_fields
),
)
)
def my_file_metadata(file_name: str):
"""Depending on the input file name, associate a different metadata."""
if "essay" in file_name:
source_type = "essay"
elif "dinosaur" in file_name:
# this (unfortunately) will not happen in this demo
source_type = "dinos"
else:
source_type = "other"
return {"source_type": source_type}
# Load documents and build index
md_documents = SimpleDirectoryReader(
"../data/paul_graham", file_metadata=my_file_metadata
).load_data()
md_index = VectorStoreIndex.from_documents(
md_documents, storage_context=md_storage_context
)
就这样:你现在可以为你的查询引擎添加过滤功能了
from llama_index.core.vector_stores import ExactMatchFilter, MetadataFilters
md_query_engine = md_index.as_query_engine(
filters=MetadataFilters(
filters=[ExactMatchFilter(key="source_type", value="essay")]
)
)
md_response = md_query_engine.query(
"How long it took the author to write his thesis?"
)
print(md_response.response)
It took the author five weeks to write his thesis.
为了测试过滤功能是否生效,尝试将其更改为仅使用 "dinos"
文档... 这次将没有答案 :)