Oracle AI 向量搜索与文档处理¶
Oracle AI 向量搜索专为人工智能 (AI) 工作负载设计,允许您基于语义而不是关键词查询数据。Oracle AI 向量搜索最大的优势之一在于,可以在一个系统中将非结构化数据的语义搜索与业务数据的关系搜索相结合。这不仅功能强大,而且效率更高,因为它无需添加专门的向量数据库,从而消除了多个系统之间数据碎片化带来的困扰。
此外,您的向量可以受益于 Oracle Database 所有最强大的功能,例如:
- 分区支持
- Real Application Clusters 可伸缩性
- Exadata 智能扫描
- 跨地域分布式数据库的分片处理
- 事务
- 并行 SQL
- 灾难恢复
- 安全性
- Oracle 机器学习
- Oracle 图数据库
- Oracle Spatial and Graph
- Oracle 区块链
- JSON
本指南演示了如何将 Oracle AI 向量搜索与 llama_index 结合使用,以构建端到端的 RAG 管线。本指南将介绍以下示例:
- 使用 OracleReader 从各种来源加载文档
- 使用 OracleSummary 在数据库内部/外部对它们进行汇总
- 使用 OracleEmbeddings 在数据库内部/外部为它们生成嵌入
- 使用 OracleTextSplitter 的高级 Oracle 功能根据不同要求对它们进行分块
- 将它们存储和索引在向量存储中,并在 OraLlamaVS 中对查询进行查询
如果您刚刚开始使用 Oracle Database,可以考虑探索免费的 Oracle 23 AI,它为设置数据库环境提供了很好的介绍。在使用数据库时,通常建议避免默认使用系统用户;相反,您可以创建自己的用户以增强安全性和自定义。有关用户创建的详细步骤,请参阅我们的端到端指南,该指南也展示了如何在 Oracle 中设置用户。此外,了解用户权限对于有效管理数据库安全至关重要。您可以在官方的Oracle 指南中了解更多关于管理用户账户和安全性的信息。
先决条件¶
请安装 Oracle `llama-index` 集成包
%pip install llama-index
%pip install llama_index-embeddings-oracleai
%pip install llama_index-readers-oracleai
%pip install llama_index-utils-oracleai
%pip install llama-index-vector-stores-oracledb
创建演示用户¶
首先,创建一个具有所有所需权限的演示用户。
import sys
import oracledb
# Update with your username, password, hostname, and service_name
username = "<username>"
password = "<password>"
dsn = "<hostname/service_name>"
try:
conn = oracledb.connect(user=username, password=password, dsn=dsn)
print("Connection successful!")
cursor = conn.cursor()
try:
cursor.execute(
"""
begin
-- Drop user
begin
execute immediate 'drop user testuser cascade';
exception
when others then
dbms_output.put_line('Error dropping user: ' || SQLERRM);
end;
-- Create user and grant privileges
execute immediate 'create user testuser identified by testuser';
execute immediate 'grant connect, unlimited tablespace, create credential, create procedure, create any index to testuser';
execute immediate 'create or replace directory DEMO_PY_DIR as ''/scratch/hroy/view_storage/hroy_devstorage/demo/orachain''';
execute immediate 'grant read, write on directory DEMO_PY_DIR to public';
execute immediate 'grant create mining model to testuser';
-- Network access
begin
DBMS_NETWORK_ACL_ADMIN.APPEND_HOST_ACE(
host => '*',
ace => xs$ace_type(privilege_list => xs$name_list('connect'),
principal_name => 'testuser',
principal_type => xs_acl.ptype_db)
);
end;
end;
"""
)
print("User setup done!")
except Exception as e:
print(f"User setup failed with error: {e}")
finally:
cursor.close()
conn.close()
except Exception as e:
print(f"Connection failed with error: {e}")
sys.exit(1)
Connection successful! User setup done!
使用 Oracle AI 处理文档¶
考虑以下场景:用户拥有存储在 Oracle Database 或文件系统中的文档,并打算将这些数据与由 llama_index 提供支持的 Oracle AI 向量搜索结合使用。
为了准备文档进行分析,需要一个全面的预处理工作流程。首先,必须根据需要检索、汇总(如果需要)并分块文档。后续步骤包括为这些分块生成嵌入,并将它们集成到 Oracle AI 向量存储中。然后,用户可以对此数据进行语义搜索。
Oracle AI 向量搜索 llama_index 库包含一套文档处理工具,可促进文档加载、分块、摘要生成和嵌入创建。
在接下来的部分中,我们将详细介绍如何利用 Oracle AI llama_index API 有效地实现这些过程。
连接到演示用户¶
以下示例代码将展示如何连接到 Oracle Database。默认情况下,python-oracledb 在“Thin”模式下运行,该模式直接连接到 Oracle Database,无需 Oracle Client 库。但是,当 python-oracledb 使用这些库时,会提供一些附加功能。当使用 Oracle Client 库时,python-oracledb 被称为处于“Thick”模式。这两种模式都具有支持 Python Database API v2.0 Specification 的全面功能。请参阅以下指南,其中讨论了每种模式支持的功能。如果您无法使用 Thin 模式,可能需要切换到 Thick 模式。
import sys
import oracledb
# please update with your username, password, hostname and service_name
username = "<username>"
password = "<password>"
dsn = "<hostname/service_name>"
try:
conn = oracledb.connect(user=username, password=password, dsn=dsn)
print("Connection successful!")
except Exception as e:
print("Connection failed!")
sys.exit(1)
Connection successful!
填充演示表¶
创建一个演示表并插入一些示例文档。
try:
cursor = conn.cursor()
drop_table_sql = """drop table demo_tab"""
cursor.execute(drop_table_sql)
create_table_sql = """create table demo_tab (id number, data clob)"""
cursor.execute(create_table_sql)
insert_row_sql = """insert into demo_tab values (:1, :2)"""
rows_to_insert = [
(
1,
"If the answer to any preceding questions is yes, then the database stops the search and allocates space from the specified tablespace; otherwise, space is allocated from the database default shared temporary tablespace.",
),
(
2,
"A tablespace can be online (accessible) or offline (not accessible) whenever the database is open.\nA tablespace is usually online so that its data is available to users. The SYSTEM tablespace and temporary tablespaces cannot be taken offline.",
),
(
3,
"The database stores LOBs differently from other data types. Creating a LOB column implicitly creates a LOB segment and a LOB index. The tablespace containing the LOB segment and LOB index, which are always stored together, may be different from the tablespace containing the table.\nSometimes the database can store small amounts of LOB data in the table itself rather than in a separate LOB segment.",
),
]
cursor.executemany(insert_row_sql, rows_to_insert)
conn.commit()
print("Table created and populated.")
cursor.close()
except Exception as e:
print("Table creation failed.")
cursor.close()
conn.close()
sys.exit(1)
Table created and populated.
在包含演示用户和已填充的示例表之后,剩余的配置涉及设置嵌入和摘要功能。用户可以选择多种提供商选项,包括本地数据库解决方案和第三方服务,例如 Ocigenai、Hugging Face 和 OpenAI。如果用户选择第三方提供商,则需要建立包含必要身份验证详细信息的凭据。相反,如果选择数据库作为嵌入提供商,则需要将 ONNX 模型上传到 Oracle Database。使用数据库选项时,无需为摘要功能进行额外设置。
加载 ONNX 模型¶
Oracle 支持各种嵌入提供商,用户可以在专有数据库解决方案和第三方服务(如 OCIGENAI 和 HuggingFace)之间进行选择。此选择决定了生成和管理嵌入的方法。
重要提示:如果用户选择数据库选项,则必须将 ONNX 模型上传到 Oracle Database。相反,如果选择第三方提供商进行嵌入生成,则无需将 ONNX 模型上传到 Oracle Database。
直接在 Oracle 中利用 ONNX 模型的一个显著优势在于,它无需将数据传输给外部方,从而增强了安全性和性能。此外,此方法避免了通常与网络或 REST API 调用相关的延迟。
下面是将 ONNX 模型上传到 Oracle Database 的示例代码
from llama_index.embeddings.oracleai import OracleEmbeddings
# please update with your related information
# make sure that you have onnx file in the system
onnx_dir = "DEMO_PY_DIR"
onnx_file = "tinybert.onnx"
model_name = "demo_model"
try:
OracleEmbeddings.load_onnx_model(conn, onnx_dir, onnx_file, model_name)
print("ONNX model loaded.")
except Exception as e:
print("ONNX model loading failed!")
sys.exit(1)
ONNX model loaded.
创建凭据¶
选择第三方提供商生成嵌入时,用户需要建立凭据以安全地访问提供商的端点。
重要提示:选择 'database' 提供商生成嵌入时,无需凭据。但是,如果用户决定使用第三方提供商,则必须创建特定于所选提供商的凭据。
下面是一个示例
try:
cursor = conn.cursor()
cursor.execute(
"""
declare
jo json_object_t;
begin
-- HuggingFace
dbms_vector_chain.drop_credential(credential_name => 'HF_CRED');
jo := json_object_t();
jo.put('access_token', '<access_token>');
dbms_vector_chain.create_credential(
credential_name => 'HF_CRED',
params => json(jo.to_string));
-- OCIGENAI
dbms_vector_chain.drop_credential(credential_name => 'OCI_CRED');
jo := json_object_t();
jo.put('user_ocid','<user_ocid>');
jo.put('tenancy_ocid','<tenancy_ocid>');
jo.put('compartment_ocid','<compartment_ocid>');
jo.put('private_key','<private_key>');
jo.put('fingerprint','<fingerprint>');
dbms_vector_chain.create_credential(
credential_name => 'OCI_CRED',
params => json(jo.to_string));
end;
"""
)
cursor.close()
print("Credentials created.")
except Exception as ex:
cursor.close()
raise
加载文档¶
用户可以根据需要配置加载器参数,灵活地从 Oracle Database、文件系统或两者加载文档。有关这些参数的详细信息,请参阅 Oracle AI Vector Search Guide。
使用 OracleReader 的一个显著优势是它能够处理超过 150 种不同的文件格式,无需针对不同文档类型使用多个加载器。有关支持格式的完整列表,请参阅 Oracle Text Supported Document Formats。
下面是一个演示如何使用 OracleReader 的示例代码片段
from llama_index.core.schema import Document
from llama_index.readers.oracleai import OracleReader
# loading from Oracle Database table
# make sure you have the table with this specification
loader_params = {}
loader_params = {
"owner": "testuser",
"tablename": "demo_tab",
"colname": "data",
}
""" load the docs """
loader = OracleReader(conn=conn, params=loader_params)
docs = loader.load()
""" verify """
print(f"Number of docs loaded: {len(docs)}")
# print(f"Document-0: {docs[0].text}") # content
Number of docs loaded: 3
生成摘要¶
用户加载文档后,可能希望为每个文档生成摘要。Oracle AI Vector Search llama_index 库提供了一套专门用于文档摘要的 API。它支持多种摘要提供商,例如 Database、OCIGENAI、HuggingFace 等,允许用户选择最符合其需求的提供商。要使用这些功能,用户必须按照规定配置摘要参数。有关这些参数的详细信息,请参阅 Oracle AI Vector Search Guide book。
注意:如果用户想使用 Oracle 内部默认提供商 'database' 以外的第三方摘要生成提供商,可能需要设置代理。如果您没有代理,请在实例化 OracleSummary 时移除 proxy 参数。
# proxy to be used when we instantiate summary and embedder object
proxy = ""
以下示例代码将展示如何生成摘要
from llama_index.core.schema import Document
from llama_index.utils.oracleai import OracleSummary
# using 'database' provider
summary_params = {
"provider": "database",
"glevel": "S",
"numParagraphs": 1,
"language": "english",
}
# get the summary instance
# Remove proxy if not required
summ = OracleSummary(conn=conn, params=summary_params, proxy=proxy)
list_summary = []
for doc in docs:
summary = summ.get_summary(doc.text)
list_summary.append(summary)
""" verify """
print(f"Number of Summaries: {len(list_summary)}")
# print(f"Summary-0: {list_summary[0]}") #content
Number of Summaries: 3
分割文档¶
文档大小可能不同,从小型到大型不等。用户通常倾向于将文档分块成更小的部分,以便于生成嵌入。此分割过程提供了多种自定义选项。有关这些参数的详细信息,请参阅 Oracle AI Vector Search Guide。
下面是一个演示如何实现的示例代码
from llama_index.core.schema import Document
from llama_index.readers.oracleai import OracleTextSplitter
# split by default parameters
splitter_params = {"normalize": "all"}
""" get the splitter instance """
splitter = OracleTextSplitter(conn=conn, params=splitter_params)
list_chunks = []
for doc in docs:
chunks = splitter.split_text(doc.text)
list_chunks.extend(chunks)
""" verify """
print(f"Number of Chunks: {len(list_chunks)}")
# print(f"Chunk-0: {list_chunks[0]}") # content
Number of Chunks: 3
生成嵌入¶
现在文档已按照要求分块,用户可能希望为这些块生成嵌入。Oracle AI Vector Search 提供了多种生成嵌入的方法,可以利用本地托管的 ONNX 模型或第三方 API。有关配置这些替代方案的完整说明,请参阅 Oracle AI Vector Search Guide。
注意:用户可能需要配置代理才能使用第三方嵌入生成提供商,但不包括使用 ONNX 模型的 'database' 提供商。
# proxy to be used when we instantiate summary and embedder object
proxy = ""
以下示例代码将展示如何生成嵌入
from llama_index.core.schema import Document
from llama_index.embeddings.oracleai import OracleEmbeddings
# using ONNX model loaded to Oracle Database
embedder_params = {"provider": "database", "model": "demo_model"}
# get the embedding instance
# Remove proxy if not required
embedder = OracleEmbeddings(conn=conn, params=embedder_params, proxy=proxy)
embeddings = []
for doc in docs:
chunks = splitter.split_text(doc.text)
for chunk in chunks:
embed = embedder._get_text_embedding(chunk)
embeddings.append(embed)
""" verify """
print(f"Number of embeddings: {len(embeddings)}")
# print(f"Embedding-0: {embeddings[0]}") # content
Number of embeddings: 3
创建 Oracle AI Vector Store¶
现在您知道如何单独使用 Oracle AI llama_index 库 API 来处理文档了,接下来我们将展示如何与 Oracle AI Vector Store 集成以方便语义搜索。
首先,让我们导入所有依赖项。
import sys
import oracledb
from llama_index.core.schema import Document, TextNode
from llama_index.readers.oracleai import OracleReader, OracleTextSplitter
from llama_index.embeddings.oracleai import OracleEmbeddings
from llama_index.utils.oracleai import OracleSummary
from llama_index.vector_stores.oracledb import OraLlamaVS, DistanceStrategy
from llama_index.vector_stores.oracledb import base as orallamavs
接下来,我们将所有文档处理阶段结合起来。以下是示例代码
"""
In this sample example, we will use 'database' provider for both summary and embeddings.
So, we don't need to do the followings:
- set proxy for 3rd party providers
- create credential for 3rd party providers
If you choose to use 3rd party provider,
please follow the necessary steps for proxy and credential.
"""
# oracle connection
# please update with your username, password, hostname, and service_name
username = "testuser"
password = "testuser"
dsn = "<hostname/service_name>"
try:
conn = oracledb.connect(user=username, password=password, dsn=dsn)
print("Connection successful!")
except Exception as e:
print("Connection failed!")
sys.exit(1)
# load onnx model
# please update with your related information
onnx_dir = "DEMO_PY_DIR"
onnx_file = "tinybert.onnx"
model_name = "demo_model"
try:
OracleEmbeddings.load_onnx_model(conn, onnx_dir, onnx_file, model_name)
print("ONNX model loaded.")
except Exception as e:
print("ONNX model loading failed!")
sys.exit(1)
# params
# please update necessary fields with related information
loader_params = {
"owner": "testuser",
"tablename": "demo_tab",
"colname": "data",
}
summary_params = {
"provider": "database",
"glevel": "S",
"numParagraphs": 1,
"language": "english",
}
splitter_params = {"normalize": "all"}
embedder_params = {"provider": "database", "model": "demo_model"}
# instantiate loader, summary, splitter, and embedder
loader = OracleReader(conn=conn, params=loader_params)
summary = OracleSummary(conn=conn, params=summary_params)
splitter = OracleTextSplitter(conn=conn, params=splitter_params)
embedder = OracleEmbeddings(conn=conn, params=embedder_params)
# process the documents
loader = OracleReader(conn=conn, params=loader_params)
docs = loader.load()
chunks_with_mdata = []
for id, doc in enumerate(docs, start=1):
summ = summary.get_summary(doc.text)
chunks = splitter.split_text(doc.text)
for ic, chunk in enumerate(chunks, start=1):
chunk_metadata = doc.metadata.copy()
chunk_metadata["id"] = (
chunk_metadata["_oid"] + "$" + str(id) + "$" + str(ic)
)
chunk_metadata["document_id"] = str(id)
chunk_metadata["document_summary"] = str(summ[0])
textnode = TextNode(
text=chunk,
id_=chunk_metadata["id"],
embedding=embedder._get_text_embedding(chunk),
metadata=chunk_metadata,
)
chunks_with_mdata.append(textnode)
""" verify """
print(f"Number of total chunks with metadata: {len(chunks_with_mdata)}")
Connection successful! ONNX model loaded. Number of total chunks with metadata: 3
此时,我们已经处理了文档并生成了带有元数据的块。接下来,我们将使用这些块创建 Oracle AI Vector Store。
以下是创建它的示例代码
# create Oracle AI Vector Store
vectorstore = OraLlamaVS.from_documents(
client=conn,
docs=chunks_with_mdata,
table_name="oravs",
distance_strategy=DistanceStrategy.DOT_PRODUCT,
)
""" verify """
print(f"Vector Store Table: {vectorstore.table_name}")
Vector Store Table: oravs
提供的示例演示了使用 DOT_PRODUCT 距离策略创建向量存储。用户可以灵活地使用 Oracle AI Vector Store 的各种距离策略,详细信息请参阅我们的综合指南。
嵌入存储在向量存储中后,建议建立索引以增强查询执行期间的语义搜索性能。
注意如果您遇到“内存不足”错误,建议您增加数据库配置中的 vector_memory_size。
以下是创建索引的示例代码片段
orallamavs.create_index(
conn, vectorstore, params={"idx_name": "hnsw_oravs", "idx_type": "HNSW"}
)
print("Index created.")
此示例演示了在 'oravs' 表中的嵌入上创建默认的 HNSW 索引。用户可以根据其特定需求调整各种参数。有关这些参数的详细信息,请参阅 Oracle AI Vector Search Guide book。
此外,可以创建各种类型的向量索引以满足不同的需求。更多详细信息请参阅我们的综合指南。
query = "What is Oracle AI Vector Store?"
filter = {"document_id": ["1"]}
# Similarity search without a filter
print(vectorstore.similarity_search(query, 1))
# Similarity search with a filter
print(vectorstore.similarity_search(query, 1, filter=filter))
# Similarity search with relevance score
print(vectorstore.similarity_search_with_score(query, 1))
# Similarity search with relevance score with filter
print(vectorstore.similarity_search_with_score(query, 1, filter=filter))
# Max marginal relevance search
print(
vectorstore.max_marginal_relevance_search(
query, 1, fetch_k=20, lambda_mult=0.5
)
)
# Max marginal relevance search with filter
print(
vectorstore.max_marginal_relevance_search(
query, 1, fetch_k=20, lambda_mult=0.5, filter=filter
)
)
[Document(page_content='The database stores LOBs differently from other data types. Creating a LOB column implicitly creates a LOB segment and a LOB index. The tablespace containing the LOB segment and LOB index, which are always stored together, may be different from the tablespace containing the table. Sometimes the database can store small amounts of LOB data in the table itself rather than in a separate LOB segment.', metadata={'_oid': '662f2f257677f3c2311a8ff999fd34e5', '_rowid': 'AAAR/xAAEAAAAAnAAC', 'id': '662f2f257677f3c2311a8ff999fd34e5$3$1', 'document_id': '3', 'document_summary': 'Sometimes the database can store small amounts of LOB data in the table itself rather than in a separate LOB segment.\n\n'})] [] [(Document(page_content='The database stores LOBs differently from other data types. Creating a LOB column implicitly creates a LOB segment and a LOB index. The tablespace containing the LOB segment and LOB index, which are always stored together, may be different from the tablespace containing the table. Sometimes the database can store small amounts of LOB data in the table itself rather than in a separate LOB segment.', metadata={'_oid': '662f2f257677f3c2311a8ff999fd34e5', '_rowid': 'AAAR/xAAEAAAAAnAAC', 'id': '662f2f257677f3c2311a8ff999fd34e5$3$1', 'document_id': '3', 'document_summary': 'Sometimes the database can store small amounts of LOB data in the table itself rather than in a separate LOB segment.\n\n'}), 0.055675752460956573)] [] [Document(page_content='If the answer to any preceding questions is yes, then the database stops the search and allocates space from the specified tablespace; otherwise, space is allocated from the database default shared temporary tablespace.', metadata={'_oid': '662f2f253acf96b33b430b88699490a2', '_rowid': 'AAAR/xAAEAAAAAnAAA', 'id': '662f2f253acf96b33b430b88699490a2$1$1', 'document_id': '1', 'document_summary': 'If the answer to any preceding questions is yes, then the database stops the search and allocates space from the specified tablespace; otherwise, space is allocated from the database default shared temporary tablespace.\n\n'})] [Document(page_content='If the answer to any preceding questions is yes, then the database stops the search and allocates space from the specified tablespace; otherwise, space is allocated from the database default shared temporary tablespace.', metadata={'_oid': '662f2f253acf96b33b430b88699490a2', '_rowid': 'AAAR/xAAEAAAAAnAAA', 'id': '662f2f253acf96b33b430b88699490a2$1$1', 'document_id': '1', 'document_summary': 'If the answer to any preceding questions is yes, then the database stops the search and allocates space from the specified tablespace; otherwise, space is allocated from the database default shared temporary tablespace.\n\n'})]