Cassandra 向量存储¶

Apache Cassandra® 是一个 NoSQL、面向行、高度可扩展且高可用的数据库。从 5.0 版本开始，该数据库附带了向量搜索功能。

通过 CQL 访问的 DataStax Astra DB 是一个基于 Cassandra 构建的托管无服务器数据库，提供相同的接口和优势。

本 Jupyter Notebook 展示了 Cassandra Vector Store 在 LlamaIndex 中的基本用法。

要运行完整代码，您需要一个具备向量搜索能力的运行中的 Cassandra 集群或一个 DataStax Astra DB 实例。

设置¶

In [ ]

已复制！

%pip install llama-index-vector-stores-cassandra
%pip install llama-index-vector-stores-cassandra

In [ ]

已复制！

!pip install --quiet "astrapy>=0.5.8"
!pip install --quiet "astrapy>=0.5.8"

In [ ]

已复制！





import os
from getpass import getpass

from llama_index.core import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    Document,
    StorageContext,
)
from llama_index.vector_stores.cassandra import CassandraVectorStore
import os from getpass import getpass from llama_index.core import ( VectorStoreIndex, SimpleDirectoryReader, Document, StorageContext, ) from llama_index.vector_stores.cassandra import CassandraVectorStore

下一步是使用全局数据库连接初始化 CassIO：这是 Cassandra 集群和 Astra DB 之间唯一略有不同的步骤。

初始化 (Cassandra 集群)¶

在这种情况下，您首先需要创建一个 cassandra.cluster.Session 对象，具体方法在Cassandra 驱动程序文档中有所描述。具体细节各异（例如网络设置和认证），但大致如下所示：

In [ ]

已复制！

from cassandra.cluster import Cluster

cluster = Cluster(["127.0.0.1"])
session = cluster.connect()
from cassandra.cluster import Cluster cluster = Cluster(["127.0.0.1"]) session = cluster.connect()

In [ ]

已复制！

import cassio

CASSANDRA_KEYSPACE = input("CASSANDRA_KEYSPACE = ")

cassio.init(session=session, keyspace=CASSANDRA_KEYSPACE)
import cassio CASSANDRA_KEYSPACE = input("CASSANDRA_KEYSPACE = ") cassio.init(session=session, keyspace=CASSANDRA_KEYSPACE)

初始化 (通过 CQL 访问的 Astra DB)¶

在这种情况下，您可以使用以下连接参数初始化 CassIO

数据库 ID，例如 01234567-89ab-cdef-0123-456789abcdef
Token，例如 AstraCS:6gBhNmsk135.... （必须是“数据库管理员”Token）
可选的 Keyspace 名称（如果省略，将使用数据库的默认 Keyspace）

In [ ]

已复制！





ASTRA_DB_ID = input("ASTRA_DB_ID = ")
ASTRA_DB_TOKEN = getpass("ASTRA_DB_TOKEN = ")

desired_keyspace = input("ASTRA_DB_KEYSPACE (optional, can be left empty) = ")
if desired_keyspace:
    ASTRA_DB_KEYSPACE = desired_keyspace
else:
    ASTRA_DB_KEYSPACE = None
ASTRA_DB_ID = input("ASTRA_DB_ID = ") ASTRA_DB_TOKEN = getpass("ASTRA_DB_TOKEN = ") desired_keyspace = input("ASTRA_DB_KEYSPACE (可选，可留空) = ") if desired_keyspace: ASTRA_DB_KEYSPACE = desired_keyspace else: ASTRA_DB_KEYSPACE = None

In [ ]

已复制！

import cassio

cassio.init(
    database_id=ASTRA_DB_ID,
    token=ASTRA_DB_TOKEN,
    keyspace=ASTRA_DB_KEYSPACE,
)
import cassio cassio.init( database_id=ASTRA_DB_ID, token=ASTRA_DB_TOKEN, keyspace=ASTRA_DB_KEYSPACE, )

OpenAI 密钥¶

为了使用 OpenAI 的嵌入向量，您需要提供一个 OpenAI API Key。

In [ ]

已复制！

os.environ["OPENAI_API_KEY"] = getpass("OpenAI API Key:")
os.environ["OPENAI_API_KEY"] = getpass("OpenAI API Key:")

下载数据¶

In [ ]

已复制！

!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
!mkdir -p 'data/paul_graham/' !wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

--2023-11-10 01:44:05--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75042 (73K) [text/plain]
Saving to: ‘data/paul_graham/paul_graham_essay.txt’

data/paul_graham/pa 100%[===================>]  73.28K  --.-KB/s    in 0.01s   

2023-11-10 01:44:06 (4.80 MB/s) - ‘data/paul_graham/paul_graham_essay.txt’ saved [75042/75042]

创建并填充向量存储¶

现在，您将从本地文件加载 Paul Graham 的一些文章，并将它们存储到 Cassandra Vector Store 中。

In [ ]

已复制！





# load documents
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
print(f"Total documents: {len(documents)}")
print(f"First document, id: {documents[0].doc_id}")
print(f"First document, hash: {documents[0].hash}")
print(
    "First document, text"
    f" ({len(documents[0].text)} characters):\n{'='*20}\n{documents[0].text[:360]} ..."
)
# load documents documents = SimpleDirectoryReader("./data/paul_graham/").load_data() print(f"总文档数: {len(documents)}") print(f"第一个文档 ID: {documents[0].doc_id}") print(f"第一个文档哈希: {documents[0].hash}") print( "第一个文档文本" f" ({len(documents[0].text)} 字符):\n{'='*20}\n{documents[0].text[:360]} ..." )

Total documents: 1
First document, id: 12bc6987-366a-49eb-8de0-7b52340e4958
First document, hash: abe31930a1775c78df5a5b1ece7108f78fedbf5fe4a9cf58d7a21808fccaef34
First document, text (75014 characters):
====================


What I Worked On

February 2021

Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined ma ...

初始化 Cassandra 向量存储¶

创建向量存储会同时创建底层的数据库表，如果该表尚不存在的话。

In [ ]

已复制！

cassandra_store = CassandraVectorStore(
    table="cass_v_table", embedding_dimension=1536
)
cassandra_store = CassandraVectorStore( table="cass_v_table", embedding_dimension=1536 )

现在将这个存储包装到一个索引 LlamaIndex 抽象中，以便后续进行查询。

In [ ]

已复制！

storage_context = StorageContext.from_defaults(vector_store=cassandra_store)

index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)
storage_context = StorageContext.from_defaults(vector_store=cassandra_store) index = VectorStoreIndex.from_documents( documents, storage_context=storage_context )

请注意，上面的 `from_documents` 调用同时做了几件事：它将输入文档分割成可管理大小的块（“节点”），计算每个节点的嵌入向量，并将它们全部存储到 Cassandra Vector Store 中。

查询存储¶

基本查询¶

In [ ]

已复制！

query_engine = index.as_query_engine()
response = query_engine.query("Why did the author choose to work on AI?")
print(response.response)
query_engine = index.as_query_engine() response = query_engine.query("Why did the author choose to work on AI?") print(response.response)

The author chose to work on AI because they were inspired by a novel called The Moon is a Harsh Mistress, which featured an intelligent computer, and a PBS documentary that showed Terry Winograd using SHRDLU. These experiences sparked the author's interest in AI and motivated them to pursue it as a field of study and work.

基于 MMR 的查询¶

MMR（最大边际相关性）方法旨在从存储中获取既与查询相关，又彼此尽可能不同的文本块，目的是为构建最终答案提供更广泛的上下文。

In [ ]

已复制！

query_engine = index.as_query_engine(vector_store_query_mode="mmr")
response = query_engine.query("Why did the author choose to work on AI?")
print(response.response)
query_engine = index.as_query_engine(vector_store_query_mode="mmr") response = query_engine.query("Why did the author choose to work on AI?") print(response.response)

The author chose to work on AI because they believed that teaching SHRDLU more words would eventually lead to the development of intelligent programs. They were fascinated by the potential of AI and saw it as an opportunity to expand their understanding of programming and push the limits of what could be achieved.

连接到现有存储¶

由于此存储由 Cassandra 支持，因此它本质上是持久化的。因此，如果您想连接到之前创建并填充的存储，方法如下：

In [ ]

已复制！





new_store_instance = CassandraVectorStore(
    table="cass_v_table", embedding_dimension=1536
)

# Create index (from preexisting stored vectors)
new_index_instance = VectorStoreIndex.from_vector_store(
    vector_store=new_store_instance
)

# now you can do querying, etc:
query_engine = new_index_instance.as_query_engine(similarity_top_k=5)
response = query_engine.query(
    "What did the author study prior to working on AI?"
)
new_store_instance = CassandraVectorStore( table="cass_v_table", embedding_dimension=1536 ) # 创建索引（从预先存在的存储向量） new_index_instance = VectorStoreIndex.from_vector_store( vector_store=new_store_instance ) # 现在您可以进行查询等操作： query_engine = new_index_instance.as_query_engine(similarity_top_k=5) response = query_engine.query( "What did the author study prior to working on AI?" )

In [ ]

已复制！

print(response.response)
print(response.response)

The author studied philosophy prior to working on AI.

从索引中删除文档¶

首先，从索引生成的 Retriever 中获取文档片段或“节点”的明确列表。

In [ ]

已复制！





retriever = new_index_instance.as_retriever(
    vector_store_query_mode="mmr",
    similarity_top_k=3,
    vector_store_kwargs={"mmr_prefetch_factor": 4},
)
nodes_with_scores = retriever.retrieve(
    "What did the author study prior to working on AI?"
)
retriever = new_index_instance.as_retriever( vector_store_query_mode="mmr", similarity_top_k=3, vector_store_kwargs={"mmr_prefetch_factor": 4}, ) nodes_with_scores = retriever.retrieve( "What did the author study prior to working on AI?" )

In [ ]

已复制！

print(f"Found {len(nodes_with_scores)} nodes.")
for idx, node_with_score in enumerate(nodes_with_scores):
    print(f"    [{idx}] score = {node_with_score.score}")
    print(f"        id    = {node_with_score.node.node_id}")
    print(f"        text  = {node_with_score.node.text[:90]} ...")
print(f"找到 {len(nodes_with_scores)} 个节点。") for idx, node_with_score in enumerate(nodes_with_scores): print(f" [{idx}] 得分 = {node_with_score.score}") print(f" ID = {node_with_score.node.node_id}") print(f" 文本 = {node_with_score.node.text[:90]} ...")

Found 3 nodes.
    [0] score = 0.4251742327832831
        id    = 7e628668-58fa-4548-9c92-8c31d315dce0
        text  = What I Worked On

February 2021

Before college the two main things I worked on, outside o ...
    [1] score = -0.020323897262800816
        id    = aa279d09-717f-4d68-9151-594c5bfef7ce
        text  = This was now only weeks away. My nice landlady let me leave my stuff in her attic. I had s ...
    [2] score = 0.011198131320563909
        id    = 50b9170d-6618-4e8b-aaf8-36632e2801a6
        text  = It seemed only a matter of time before we'd have Mike, and when I saw Winograd using SHRDL ...

但是等等！使用向量存储时，您应该将文档视为合理的删除单元，而不是属于它的任何单个节点。好吧，在这种情况下，您只插入了一个文本文件，因此所有节点都将具有相同的 ref_doc_id。

In [ ]

已复制！

print("Nodes' ref_doc_id:")
print("\n".join([nws.node.ref_doc_id for nws in nodes_with_scores]))
print("节点的 ref_doc_id:") print("\n".join([nws.node.ref_doc_id for nws in nodes_with_scores]))

Nodes' ref_doc_id:
12bc6987-366a-49eb-8de0-7b52340e4958
12bc6987-366a-49eb-8de0-7b52340e4958
12bc6987-366a-49eb-8de0-7b52340e4958

现在假设您需要移除您上传的文本文件。

In [ ]

已复制！

new_store_instance.delete(nodes_with_scores[0].node.ref_doc_id)
new_store_instance.delete(nodes_with_scores[0].node.ref_doc_id)

重复相同的查询并检查现在的结果。您应该会看到未找到任何结果。

In [ ]

已复制！

nodes_with_scores = retriever.retrieve(
    "What did the author study prior to working on AI?"
)

print(f"Found {len(nodes_with_scores)} nodes.")
nodes_with_scores = retriever.retrieve( "What did the author study prior to working on AI?" ) print(f"找到 {len(nodes_with_scores)} 个节点。")

Found 0 nodes.

元数据过滤¶

Cassandra 向量存储在查询时支持以精确匹配的 key=value 对形式进行元数据过滤。以下单元（在新创建的 Cassandra 表上运行）演示了此功能。

在此演示中，为简洁起见，加载了单个源文档（即 ../data/paul_graham/paul_graham_essay.txt 文本文件）。尽管如此，您仍将为文档附加一些自定义元数据，以说明如何使用附加到文档的元数据条件来限制查询。

In [ ]

已复制！





md_storage_context = StorageContext.from_defaults(
    vector_store=CassandraVectorStore(
        table="cass_v_table_md", embedding_dimension=1536
    )
)


def my_file_metadata(file_name: str):
    """Depending on the input file name, associate a different metadata."""
    if "essay" in file_name:
        source_type = "essay"
    elif "dinosaur" in file_name:
        # this (unfortunately) will not happen in this demo
        source_type = "dinos"
    else:
        source_type = "other"
    return {"source_type": source_type}


# Load documents and build index
md_documents = SimpleDirectoryReader(
    "./data/paul_graham", file_metadata=my_file_metadata
).load_data()
md_index = VectorStoreIndex.from_documents(
    md_documents, storage_context=md_storage_context
)
md_storage_context = StorageContext.from_defaults( vector_store=CassandraVectorStore( table="cass_v_table_md", embedding_dimension=1536 ) ) def my_file_metadata(file_name: str): """根据输入文件名关联不同的元数据。""" if "essay" in file_name: source_type = "essay" elif "dinosaur" in file_name: # 在此演示中（很遗憾）不会发生这种情况 source_type = "dinos" else: source_type = "other" return {"source_type": source_type} # 加载文档并构建索引 md_documents = SimpleDirectoryReader( "./data/paul_graham", file_metadata=my_file_metadata ).load_data() md_index = VectorStoreIndex.from_documents( md_documents, storage_context=md_storage_context )

就这样：您现在可以为查询引擎添加过滤功能了。

In [ ]

已复制！

from llama_index.core.vector_stores import ExactMatchFilter, MetadataFilters
from llama_index.core.vector_stores import ExactMatchFilter, MetadataFilters

In [ ]

已复制！





md_query_engine = md_index.as_query_engine(
    filters=MetadataFilters(
        filters=[ExactMatchFilter(key="source_type", value="essay")]
    )
)
md_response = md_query_engine.query(
    "did the author appreciate Lisp and painting?"
)
print(md_response.response)
md_query_engine = md_index.as_query_engine( filters=MetadataFilters( filters=[ExactMatchFilter(key="source_type", value="essay")] ) ) md_response = md_query_engine.query( "did the author appreciate Lisp and painting?" ) print(md_response.response)

Yes, the author appreciated Lisp and painting. They mentioned spending a significant amount of time working on Lisp and even building a new dialect of Lisp called Arc. Additionally, the author mentioned spending most of 2014 painting and experimenting with different techniques.

要测试过滤功能是否生效，请尝试将其更改为仅使用“dinos”文档......这次将不会有答案 :)