Cassandra 向量存储¶
设置¶
%pip install llama-index-vector-stores-cassandra
!pip install --quiet "astrapy>=0.5.8"
import os
from getpass import getpass
from llama_index.core import (
VectorStoreIndex,
SimpleDirectoryReader,
Document,
StorageContext,
)
from llama_index.vector_stores.cassandra import CassandraVectorStore
下一步是使用全局数据库连接初始化 CassIO:这是 Cassandra 集群和 Astra DB 之间唯一略有不同的步骤。
初始化 (Cassandra 集群)¶
在这种情况下,您首先需要创建一个 cassandra.cluster.Session
对象,具体方法在Cassandra 驱动程序文档中有所描述。具体细节各异(例如网络设置和认证),但大致如下所示:
from cassandra.cluster import Cluster
cluster = Cluster(["127.0.0.1"])
session = cluster.connect()
import cassio
CASSANDRA_KEYSPACE = input("CASSANDRA_KEYSPACE = ")
cassio.init(session=session, keyspace=CASSANDRA_KEYSPACE)
初始化 (通过 CQL 访问的 Astra DB)¶
在这种情况下,您可以使用以下连接参数初始化 CassIO
- 数据库 ID,例如 01234567-89ab-cdef-0123-456789abcdef
- Token,例如 AstraCS:6gBhNmsk135.... (必须是“数据库管理员”Token)
- 可选的 Keyspace 名称(如果省略,将使用数据库的默认 Keyspace)
ASTRA_DB_ID = input("ASTRA_DB_ID = ")
ASTRA_DB_TOKEN = getpass("ASTRA_DB_TOKEN = ")
desired_keyspace = input("ASTRA_DB_KEYSPACE (optional, can be left empty) = ")
if desired_keyspace:
ASTRA_DB_KEYSPACE = desired_keyspace
else:
ASTRA_DB_KEYSPACE = None
import cassio
cassio.init(
database_id=ASTRA_DB_ID,
token=ASTRA_DB_TOKEN,
keyspace=ASTRA_DB_KEYSPACE,
)
OpenAI 密钥¶
为了使用 OpenAI 的嵌入向量,您需要提供一个 OpenAI API Key。
os.environ["OPENAI_API_KEY"] = getpass("OpenAI API Key:")
下载数据¶
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
--2023-11-10 01:44:05-- https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ... Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 75042 (73K) [text/plain] Saving to: ‘data/paul_graham/paul_graham_essay.txt’ data/paul_graham/pa 100%[===================>] 73.28K --.-KB/s in 0.01s 2023-11-10 01:44:06 (4.80 MB/s) - ‘data/paul_graham/paul_graham_essay.txt’ saved [75042/75042]
创建并填充向量存储¶
现在,您将从本地文件加载 Paul Graham 的一些文章,并将它们存储到 Cassandra Vector Store 中。
# load documents
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
print(f"Total documents: {len(documents)}")
print(f"First document, id: {documents[0].doc_id}")
print(f"First document, hash: {documents[0].hash}")
print(
"First document, text"
f" ({len(documents[0].text)} characters):\n{'='*20}\n{documents[0].text[:360]} ..."
)
Total documents: 1 First document, id: 12bc6987-366a-49eb-8de0-7b52340e4958 First document, hash: abe31930a1775c78df5a5b1ece7108f78fedbf5fe4a9cf58d7a21808fccaef34 First document, text (75014 characters): ==================== What I Worked On February 2021 Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined ma ...
初始化 Cassandra 向量存储¶
创建向量存储会同时创建底层的数据库表,如果该表尚不存在的话。
cassandra_store = CassandraVectorStore(
table="cass_v_table", embedding_dimension=1536
)
现在将这个存储包装到一个索引 LlamaIndex 抽象中,以便后续进行查询。
storage_context = StorageContext.from_defaults(vector_store=cassandra_store)
index = VectorStoreIndex.from_documents(
documents, storage_context=storage_context
)
请注意,上面的 `from_documents` 调用同时做了几件事:它将输入文档分割成可管理大小的块(“节点”),计算每个节点的嵌入向量,并将它们全部存储到 Cassandra Vector Store 中。
查询存储¶
基本查询¶
query_engine = index.as_query_engine()
response = query_engine.query("Why did the author choose to work on AI?")
print(response.response)
The author chose to work on AI because they were inspired by a novel called The Moon is a Harsh Mistress, which featured an intelligent computer, and a PBS documentary that showed Terry Winograd using SHRDLU. These experiences sparked the author's interest in AI and motivated them to pursue it as a field of study and work.
基于 MMR 的查询¶
MMR(最大边际相关性)方法旨在从存储中获取既与查询相关,又彼此尽可能不同的文本块,目的是为构建最终答案提供更广泛的上下文。
query_engine = index.as_query_engine(vector_store_query_mode="mmr")
response = query_engine.query("Why did the author choose to work on AI?")
print(response.response)
The author chose to work on AI because they believed that teaching SHRDLU more words would eventually lead to the development of intelligent programs. They were fascinated by the potential of AI and saw it as an opportunity to expand their understanding of programming and push the limits of what could be achieved.
连接到现有存储¶
由于此存储由 Cassandra 支持,因此它本质上是持久化的。因此,如果您想连接到之前创建并填充的存储,方法如下:
new_store_instance = CassandraVectorStore(
table="cass_v_table", embedding_dimension=1536
)
# Create index (from preexisting stored vectors)
new_index_instance = VectorStoreIndex.from_vector_store(
vector_store=new_store_instance
)
# now you can do querying, etc:
query_engine = new_index_instance.as_query_engine(similarity_top_k=5)
response = query_engine.query(
"What did the author study prior to working on AI?"
)
print(response.response)
The author studied philosophy prior to working on AI.
从索引中删除文档¶
首先,从索引生成的 Retriever
中获取文档片段或“节点”的明确列表。
retriever = new_index_instance.as_retriever(
vector_store_query_mode="mmr",
similarity_top_k=3,
vector_store_kwargs={"mmr_prefetch_factor": 4},
)
nodes_with_scores = retriever.retrieve(
"What did the author study prior to working on AI?"
)
print(f"Found {len(nodes_with_scores)} nodes.")
for idx, node_with_score in enumerate(nodes_with_scores):
print(f" [{idx}] score = {node_with_score.score}")
print(f" id = {node_with_score.node.node_id}")
print(f" text = {node_with_score.node.text[:90]} ...")
Found 3 nodes. [0] score = 0.4251742327832831 id = 7e628668-58fa-4548-9c92-8c31d315dce0 text = What I Worked On February 2021 Before college the two main things I worked on, outside o ... [1] score = -0.020323897262800816 id = aa279d09-717f-4d68-9151-594c5bfef7ce text = This was now only weeks away. My nice landlady let me leave my stuff in her attic. I had s ... [2] score = 0.011198131320563909 id = 50b9170d-6618-4e8b-aaf8-36632e2801a6 text = It seemed only a matter of time before we'd have Mike, and when I saw Winograd using SHRDL ...
但是等等!使用向量存储时,您应该将文档视为合理的删除单元,而不是属于它的任何单个节点。好吧,在这种情况下,您只插入了一个文本文件,因此所有节点都将具有相同的 ref_doc_id
。
print("Nodes' ref_doc_id:")
print("\n".join([nws.node.ref_doc_id for nws in nodes_with_scores]))
Nodes' ref_doc_id: 12bc6987-366a-49eb-8de0-7b52340e4958 12bc6987-366a-49eb-8de0-7b52340e4958 12bc6987-366a-49eb-8de0-7b52340e4958
现在假设您需要移除您上传的文本文件。
new_store_instance.delete(nodes_with_scores[0].node.ref_doc_id)
重复相同的查询并检查现在的结果。您应该会看到未找到任何结果。
nodes_with_scores = retriever.retrieve(
"What did the author study prior to working on AI?"
)
print(f"Found {len(nodes_with_scores)} nodes.")
Found 0 nodes.
元数据过滤¶
Cassandra 向量存储在查询时支持以精确匹配的 key=value
对形式进行元数据过滤。以下单元(在新创建的 Cassandra 表上运行)演示了此功能。
在此演示中,为简洁起见,加载了单个源文档(即 ../data/paul_graham/paul_graham_essay.txt
文本文件)。尽管如此,您仍将为文档附加一些自定义元数据,以说明如何使用附加到文档的元数据条件来限制查询。
md_storage_context = StorageContext.from_defaults(
vector_store=CassandraVectorStore(
table="cass_v_table_md", embedding_dimension=1536
)
)
def my_file_metadata(file_name: str):
"""Depending on the input file name, associate a different metadata."""
if "essay" in file_name:
source_type = "essay"
elif "dinosaur" in file_name:
# this (unfortunately) will not happen in this demo
source_type = "dinos"
else:
source_type = "other"
return {"source_type": source_type}
# Load documents and build index
md_documents = SimpleDirectoryReader(
"./data/paul_graham", file_metadata=my_file_metadata
).load_data()
md_index = VectorStoreIndex.from_documents(
md_documents, storage_context=md_storage_context
)
就这样:您现在可以为查询引擎添加过滤功能了。
from llama_index.core.vector_stores import ExactMatchFilter, MetadataFilters
md_query_engine = md_index.as_query_engine(
filters=MetadataFilters(
filters=[ExactMatchFilter(key="source_type", value="essay")]
)
)
md_response = md_query_engine.query(
"did the author appreciate Lisp and painting?"
)
print(md_response.response)
Yes, the author appreciated Lisp and painting. They mentioned spending a significant amount of time working on Lisp and even building a new dialect of Lisp called Arc. Additionally, the author mentioned spending most of 2014 painting and experimenting with different techniques.
要测试过滤功能是否生效,请尝试将其更改为仅使用“dinos”文档......这次将不会有答案 :)