Oracle AI Vector Search：文档处理¶

Oracle AI Vector Search 专为人工智能 (AI) 工作负载设计，允许您基于语义而非关键词查询数据。Oracle AI Vector Search 的最大优势之一在于，可以在一个系统中将非结构化数据的语义搜索与业务数据的关系搜索相结合。这不仅功能强大，而且效率显著更高，因为您无需添加专用的向量数据库，从而消除了多个系统之间数据碎片化带来的困扰。

此外，您的向量可以受益于 Oracle Database 最强大的所有功能，例如：

本指南演示了如何利用 Oracle AI Vector Search 的文档处理能力，分别使用 OracleDocLoader 和 OracleTextSplitter 来加载和分块文档。

如果您刚开始接触 Oracle Database，可以考虑探索免费的 Oracle 23 AI，它为设置数据库环境提供了很好的入门。在使用数据库时，通常建议避免默认使用 system 用户；相反，您可以创建自己的用户以增强安全性和定制性。有关用户创建的详细步骤，请参考我们的端到端指南，其中也展示了如何在 Oracle 中设置用户。此外，理解用户权限对于有效管理数据库安全至关重要。您可以在官方的 Oracle 用户账户和安全管理指南中了解更多信息。

前提条件¶

请安装 Oracle Python Client 驱动程序，以便将 llama_index 与 Oracle AI Vector Search 结合使用。

In [ ]

已复制！

%pip install llama-index-readers-oracleai
%pip install llama-index-readers-oracleai

连接到 Oracle Database¶

以下示例代码将展示如何连接到 Oracle Database。默认情况下，python-oracledb 在“精简 (Thin)”模式下运行，直接连接到 Oracle Database。此模式不需要 Oracle Client 库。但是，当 python-oracledb 使用 Oracle Client 库时，会提供一些附加功能。当使用 Oracle Client 库时，python-oracledb 被称为处于“厚重 (Thick)”模式。这两种模式都提供全面功能，支持 Python Database API v2.0 规范。请参阅以下指南，其中介绍了每种模式支持的功能。如果无法使用精简模式，您可能需要切换到厚重模式。

In [ ]

已复制！





import sys

import oracledb

# please update with your username, password, hostname and service_name
username = "<username>"
password = "<password>"
dsn = "<hostname>/<service_name>"

try:
    conn = oracledb.connect(user=username, password=password, dsn=dsn)
    print("Connection successful!")
except Exception as e:
    print("Connection failed!")
    sys.exit(1)
import sys import oracledb # 请用您的用户名、密码、主机名和服务名更新 username = "" password = "" dsn = "/" try: conn = oracledb.connect(user=username, password=password, dsn=dsn) print("连接成功!") except Exception as e: print("连接失败!") sys.exit(1)

现在我们来创建一个表并插入一些示例文档进行测试。

In [ ]

已复制！





try:
    cursor = conn.cursor()

    drop_table_sql = """drop table if exists demo_tab"""
    cursor.execute(drop_table_sql)

    create_table_sql = """create table demo_tab (id number, data clob)"""
    cursor.execute(create_table_sql)

    insert_row_sql = """insert into demo_tab values (:1, :2)"""
    rows_to_insert = [
        (
            1,
            "If the answer to any preceding questions is yes, then the database stops the search and allocates space from the specified tablespace; otherwise, space is allocated from the database default shared temporary tablespace.",
        ),
        (
            2,
            "A tablespace can be online (accessible) or offline (not accessible) whenever the database is open.\nA tablespace is usually online so that its data is available to users. The SYSTEM tablespace and temporary tablespaces cannot be taken offline.",
        ),
        (
            3,
            "The database stores LOBs differently from other data types. Creating a LOB column implicitly creates a LOB segment and a LOB index. The tablespace containing the LOB segment and LOB index, which are always stored together, may be different from the tablespace containing the table.\nSometimes the database can store small amounts of LOB data in the table itself rather than in a separate LOB segment.",
        ),
    ]
    cursor.executemany(insert_row_sql, rows_to_insert)

    conn.commit()

    print("Table created and populated.")
    cursor.close()
except Exception as e:
    print("Table creation failed.")
    cursor.close()
    conn.close()
    sys.exit(1)
try: cursor = conn.cursor() drop_table_sql = """drop table if exists demo_tab""" cursor.execute(drop_table_sql) create_table_sql = """create table demo_tab (id number, data clob)""" cursor.execute(create_table_sql) insert_row_sql = """insert into demo_tab values (:1, :2)""" rows_to_insert = [ ( 1, "If the answer to any preceding questions is yes, then the database stops the search and allocates space from the specified tablespace; otherwise, space is allocated from the database default shared temporary tablespace.", ), ( 2, "A tablespace can be online (accessible) or offline (not accessible) whenever the database is open.\nA tablespace is usually online so that its data is available to users. The SYSTEM tablespace and temporary tablespaces cannot be taken offline.", ), ( 3, "The database stores LOBs differently from other data types. Creating a LOB column implicitly creates a LOB segment and a LOB index. The tablespace containing the LOB segment and LOB index, which are always stored together, may be different from the tablespace containing the table.\nSometimes the database can store small amounts of LOB data in the table itself rather than in a separate LOB segment.", ), ] cursor.executemany(insert_row_sql, rows_to_insert) conn.commit() print("表已创建并填充数据.") cursor.close() except Exception as e: print("创建表失败.") cursor.close() conn.close() sys.exit(1)

加载文档¶

用户可以灵活地从 Oracle Database、文件系统或两者加载文档，只需适当配置加载器参数即可。有关这些参数的详细信息，请查阅 Oracle AI Vector Search 指南。

使用 OracleDocLoader 的一个显著优势在于其能够处理超过 150 种不同的文件格式，从而无需针对不同文档类型使用多个加载器。有关支持格式的完整列表，请参阅 Oracle Text 支持的文档格式。

下面是一个示例代码片段，演示了如何使用 OracleDocLoader

In [ ]

已复制！





from llama_index.core.schema import Document
from llama_index.readers.oracleai import OracleReader

"""
# loading a local file
loader_params = {}
loader_params["file"] = "<file>"

# loading from a local directory
loader_params = {}
loader_params["dir"] = "<directory>"
"""

# loading from Oracle Database table
loader_params = {
    "owner": "<owner>",
    "tablename": "demo_tab",
    "colname": "data",
}

""" load the docs """
loader = OracleReader(conn=conn, params=loader_params)
docs = loader.load()

""" verify """
print(f"Number of docs loaded: {len(docs)}")
# print(f"Document-0: {docs[0].text}") # content
from llama_index.core.schema import Document from llama_index.readers.oracleai import OracleReader """ # 加载本地文件 loader_params = {} loader_params["file"] = "" # 从本地目录加载 loader_params = {} loader_params["dir"] = "" """ # 从 Oracle Database 表加载 loader_params = { "owner": "", "tablename": "demo_tab", "colname": "data", } """ 加载文档 """ loader = OracleReader(conn=conn, params=loader_params) docs = loader.load() """ 验证 """ print(f"加载的文档数量: {len(docs)}") # print(f"文档-0: {docs[0].text}") # 内容

分割文档¶

文档的大小可能差异很大，从小到非常大。用户通常倾向于将文档分块成更小的部分，以便于生成嵌入。此分割过程提供了广泛的定制选项。有关这些参数的详细信息，请查阅 Oracle AI Vector Search 指南。

下面是一个示例代码，说明如何实现此功能

In [ ]

已复制！





from llama_index.core.schema import Document
from llama_index.readers.oracleai import OracleTextSplitter

"""
# Some examples
# split by chars, max 500 chars
splitter_params = {"split": "chars", "max": 500, "normalize": "all"}

# split by words, max 100 words
splitter_params = {"split": "words", "max": 100, "normalize": "all"}

# split by sentence, max 20 sentences
splitter_params = {"split": "sentence", "max": 20, "normalize": "all"}
"""

# split by default parameters
splitter_params = {"normalize": "all"}

# get the splitter instance
splitter = OracleTextSplitter(conn=conn, params=splitter_params)

list_chunks = []
for doc in docs:
    chunks = splitter.split_text(doc.text)
    list_chunks.extend(chunks)

""" verify """
print(f"Number of Chunks: {len(list_chunks)}")
# print(f"Chunk-0: {list_chunks[0]}") # content
from llama_index.core.schema import Document from llama_index.readers.oracleai import OracleTextSplitter """ # 一些示例 # 按字符分割，最大 500 个字符 splitter_params = {"split": "chars", "max": 500, "normalize": "all"} # 按词分割，最大 100 个词 splitter_params = {"split": "words", "max": 100, "normalize": "all"} # 按句子分割，最大 20 个句子 splitter_params = {"split": "sentence", "max": 20, "normalize": "all"} """ # 按默认参数分割 splitter_params = {"normalize": "all"} # 获取分割器实例 splitter = OracleTextSplitter(conn=conn, params=splitter_params) list_chunks = [] for doc in docs: chunks = splitter.split_text(doc.text) list_chunks.extend(chunks) """ 验证 """ print(f"块的数量: {len(list_chunks)}") # print(f"块-0: {list_chunks[0]}") # 内容

端到端演示¶

请参阅我们的完整演示指南 Oracle AI Vector Search 端到端演示指南，以借助 Oracle AI Vector Search 构建端到端 RAG 管道。