Cassandra 数据库工具¶
Apache Cassandra® 是一个广泛用于存储事务性应用数据的数据库。在大语言模型中引入函数和工具为生成式 AI 应用中利用现有数据开辟了一些令人兴奋的用例。Cassandra 数据库工具包使 AI 工程师能够高效地将 Agent 与 Cassandra 数据集成,提供以下功能:
- 通过优化查询实现快速数据访问。大多数查询应在个位数毫秒或更短时间内运行。
- 模式自省以增强 LLM 的推理能力
- 与各种 Cassandra 部署兼容,包括 Apache Cassandra®、DataStax Enterprise™ 和 DataStax Astra™
- 目前,该工具包仅限于 SELECT 查询和模式自省操作。(安全第一)
快速开始¶
- 安装 cassio 库
- 设置您要连接的 Cassandra 数据库的环境变量
- 初始化 CassandraDatabase
- 使用 spec.to_tool_list() 将工具传递给您的 agent
- 坐下来,看着它为您完成所有工作
操作原理¶
Cassandra 查询语言 (CQL) 是与 Cassandra 数据库交互的主要以 *人为中心* 的方式。虽然在生成查询时提供了一些灵活性,但它需要了解 Cassandra 数据建模的最佳实践。LLM 函数调用使 agent 能够进行推理,然后选择工具来满足请求。使用 LLM 的 agent 在选择适当的工具或工具链时应使用 Cassandra 特有的逻辑进行推理。这减少了 LLM 被迫提供自上而下解决方案时引入的随机性。您希望 LLM 完全不受限制地访问您的数据库吗?是的。可能不希望。为了实现这一点,我们提供了一个用于为 agent 构建问题的提示词
You are an Apache Cassandra expert query analysis bot with the following features
and rules:
- You will take a question from the end user about finding specific
data in the database.
- You will examine the schema of the database and create a query path.
- You will provide the user with the correct query to find the data they are looking
for, showing the steps provided by the query path.
- You will use best practices for querying Apache Cassandra using partition keys
and clustering columns.
- Avoid using ALLOW FILTERING in the query.
- The goal is to find a query path, so it may take querying other tables to get
to the final answer.
The following is an example of a query path in JSON format:
{
"query_paths": [
{
"description": "Direct query to users table using email",
"steps": [
{
"table": "user_credentials",
"query":
"SELECT userid FROM user_credentials WHERE email = '[email protected]';"
},
{
"table": "users",
"query": "SELECT * FROM users WHERE userid = ?;"
}
]
}
]
}
提供的工具¶
cassandra_db_schema
¶
收集连接的数据库或特定模式的所有模式信息。对于 agent 确定操作至关重要。
cassandra_db_select_table_data
¶
从特定的键空间和表中选择数据。agent 可以传递谓词参数以及返回记录数量的限制。
cassandra_db_query
¶
cassandra_db_select_table_data
的实验性替代方案,它接受由 agent 完全形成的查询字符串而不是参数。警告:这可能导致不寻常的查询,这些查询可能性能不佳(甚至无法工作)。这可能会在未来的版本中删除。如果它做了一些很酷的事情,我们也想知道。你永远不知道!
环境设置¶
安装以下 Python 模块
pip install ipykernel python-dotenv cassio llama-index llama-index-agent-openai llama-index-llms-openai llama-index-tools-cassandra
.env 文件¶
连接通过使用 auto=True
参数的 cassio
进行,并且 notebook 使用 OpenAI。您应相应地创建一个 .env
文件。
对于 Cassandra,设置
CASSANDRA_CONTACT_POINTS
CASSANDRA_USERNAME
CASSANDRA_PASSWORD
CASSANDRA_KEYSPACE
对于 Astra,设置
ASTRA_DB_APPLICATION_TOKEN
ASTRA_DB_DATABASE_ID
ASTRA_DB_KEYSPACE
例如
# Connection to Astra:
ASTRA_DB_DATABASE_ID=a1b2c3d4-...
ASTRA_DB_APPLICATION_TOKEN=AstraCS:...
ASTRA_DB_KEYSPACE=notebooks
# Also set
OPENAI_API_KEY=sk-....
(您也可以修改以下代码以直接与 cassio
连接。)
from dotenv import load_dotenv
load_dotenv(override=True)
# Import necessary libraries
import os
import cassio
from llama_index.tools.cassandra.base import CassandraDatabaseToolSpec
from llama_index.tools.cassandra.cassandra_database_wrapper import (
CassandraDatabase,
)
from llama_index.agent.openai import OpenAIAgent
from llama_index.llms.openai import OpenAI
连接到 Cassandra 数据库¶
cassio.init(auto=True)
session = cassio.config.resolve_session()
if not session:
raise Exception(
"Check environment configuration or manually configure cassio connection parameters"
)
# Test data prep
session = cassio.config.resolve_session()
session.execute("""DROP KEYSPACE IF EXISTS llamaindex_agent_test; """)
session.execute(
"""
CREATE KEYSPACE if not exists llamaindex_agent_test
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};
"""
)
session.execute(
"""
CREATE TABLE IF NOT EXISTS llamaindex_agent_test.user_credentials (
user_email text PRIMARY KEY,
user_id UUID,
password TEXT
);
"""
)
session.execute(
"""
CREATE TABLE IF NOT EXISTS llamaindex_agent_test.users (
id UUID PRIMARY KEY,
name TEXT,
email TEXT
);"""
)
session.execute(
"""
CREATE TABLE IF NOT EXISTS llamaindex_agent_test.user_videos (
user_id UUID,
video_id UUID,
title TEXT,
description TEXT,
PRIMARY KEY (user_id, video_id)
);
"""
)
user_id = "522b1fe2-2e36-4cef-a667-cd4237d08b89"
video_id = "27066014-bad7-9f58-5a30-f63fe03718f6"
session.execute(
f"""
INSERT INTO llamaindex_agent_test.user_credentials (user_id, user_email)
VALUES ({user_id}, '[email protected]');
"""
)
session.execute(
f"""
INSERT INTO llamaindex_agent_test.users (id, name, email)
VALUES ({user_id}, 'Patrick McFadin', '[email protected]');
"""
)
session.execute(
f"""
INSERT INTO llamaindex_agent_test.user_videos (user_id, video_id, title)
VALUES ({user_id}, {video_id}, 'Use Langflow to Build an LLM Application in 5 Minutes');
"""
)
session.set_keyspace("llamaindex_agent_test")
# Create a CassandraDatabaseToolSpec object
db = CassandraDatabase()
spec = CassandraDatabaseToolSpec(db=db)
tools = spec.to_tool_list()
for tool in tools:
print(tool.metadata.name)
print(tool.metadata.description)
print(tool.metadata.fn_schema)
cassandra_db_schema cassandra_db_schema(keyspace: str) -> List[llama_index.core.schema.Document] Input to this tool is a keyspace name, output is a table description of Apache Cassandra tables. If the query is not correct, an error message will be returned. If an error is returned, report back to the user that the keyspace doesn't exist and stop. Args: keyspace (str): The name of the keyspace for which to return the schema. Returns: List[Document]: A list of Document objects, each containing a table description. <class 'pydantic.main.cassandra_db_schema'> cassandra_db_select_table_data cassandra_db_select_table_data(keyspace: str, table: str, predicate: str, limit: int) -> List[llama_index.core.schema.Document] Tool for getting data from a table in an Apache Cassandra database. Use the WHERE clause to specify the predicate for the query that uses the primary key. A blank predicate will return all rows. Avoid this if possible. Use the limit to specify the number of rows to return. A blank limit will return all rows. Args: keyspace (str): The name of the keyspace containing the table. table (str): The name of the table for which to return data. predicate (str): The predicate for the query that uses the primary key. limit (int): The maximum number of rows to return. Returns: List[Document]: A list of Document objects, each containing a row of data. <class 'pydantic.main.cassandra_db_select_table_data'>
# Choose the LLM that will drive the agent
# Only certain models support this
llm = OpenAI(model="gpt-4-1106-preview")
# Create the Agent with our tools. Verbose will echo the agent's actions
agent = OpenAIAgent.from_tools(tools, llm=llm, verbose=True)
使用工具调用 agent¶
我们创建了一个 agent,它使用 LLM 进行推理和通信,并使用工具列表进行操作。现在我们可以简单地向 agent 提问,并观察它如何利用我们提供的工具。
# Ask our new agent a series of questions. What how the agent uses tools to get the answers.
agent.chat("What tables are in the keyspace llamaindex_agent_test?")
agent.chat("What is the userid for [email protected] ?")
agent.chat("What videos did user [email protected] upload?")
Added user message to memory: What tables are in the keyspace llamaindex_agent_test? === Calling Function === Calling function: cassandra_db_schema with args: {"keyspace":"llamaindex_agent_test"} Got output: [Document(id_='4b6011e6-62e6-4db2-9198-046534b7c8dd', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Table Name: user_credentials\n- Keyspace: llamaindex_agent_test\n- Columns\n - password (text)\n - user_email (text)\n - user_id (uuid)\n- Partition Keys: (user_email)\n- Clustering Keys: \n\nTable Name: user_videos\n- Keyspace: llamaindex_agent_test\n- Columns\n - description (text)\n - title (text)\n - user_id (uuid)\n - video_id (uuid)\n- Partition Keys: (user_id)\n- Clustering Keys: (video_id asc)\n\n\nTable Name: users\n- Keyspace: llamaindex_agent_test\n- Columns\n - email (text)\n - id (uuid)\n - name (text)\n- Partition Keys: (id)\n- Clustering Keys: \n\n', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n')] ======================== Added user message to memory: What is the userid for [email protected] ? === Calling Function === Calling function: cassandra_db_select_table_data with args: {"keyspace":"llamaindex_agent_test","table":"user_credentials","predicate":"user_email = '[email protected]'","limit":1} Got output: [Document(id_='e5620177-c735-46f8-a09a-a0e062efcdec', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text="Row(user_email='[email protected]', password=None, user_id=UUID('522b1fe2-2e36-4cef-a667-cd4237d08b89'))", start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n')] ======================== Added user message to memory: What videos did user [email protected] upload? === Calling Function === Calling function: cassandra_db_select_table_data with args: {"keyspace":"llamaindex_agent_test","table":"user_videos","predicate":"user_id = 522b1fe2-2e36-4cef-a667-cd4237d08b89","limit":10} Got output: [Document(id_='e3ecfba1-e8e1-4ce3-b321-3f51e12077a1', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text="Row(user_id=UUID('522b1fe2-2e36-4cef-a667-cd4237d08b89'), video_id=UUID('27066014-bad7-9f58-5a30-f63fe03718f6'), description=None, title='Use Langflow to Build an LLM Application in 5 Minutes')", start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n')] ========================
AgentChatResponse(response='The user `[email protected]` uploaded the following video in the `llamaindex_agent_test` keyspace:\n\n- Title: "Use Langflow to Build an LLM Application in 5 Minutes"\n- Video ID: `27066014-bad7-9f58-5a30-f63fe03718f6`\n- Description: Not provided', sources=[ToolOutput(content='[Document(id_=\'e3ecfba1-e8e1-4ce3-b321-3f51e12077a1\', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text="Row(user_id=UUID(\'522b1fe2-2e36-4cef-a667-cd4237d08b89\'), video_id=UUID(\'27066014-bad7-9f58-5a30-f63fe03718f6\'), description=None, title=\'Use Langflow to Build an LLM Application in 5 Minutes\')", start_char_idx=None, end_char_idx=None, text_template=\'{metadata_str}\\n\\n{content}\', metadata_template=\'{key}: {value}\', metadata_seperator=\'\\n\')]', tool_name='cassandra_db_select_table_data', raw_input={'args': (), 'kwargs': {'keyspace': 'llamaindex_agent_test', 'table': 'user_videos', 'predicate': 'user_id = 522b1fe2-2e36-4cef-a667-cd4237d08b89', 'limit': 10}}, raw_output=[Document(id_='e3ecfba1-e8e1-4ce3-b321-3f51e12077a1', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text="Row(user_id=UUID('522b1fe2-2e36-4cef-a667-cd4237d08b89'), video_id=UUID('27066014-bad7-9f58-5a30-f63fe03718f6'), description=None, title='Use Langflow to Build an LLM Application in 5 Minutes')", start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n')], is_error=False)], source_nodes=[], is_dummy_stream=False)