Text-to-SQL 指南（查询引擎 + 检索器）¶

这是 LlamaIndex Text-to-SQL 能力的基本指南。

我们首先展示如何在玩具数据集上执行 Text-to-SQL：这将执行“检索”（对数据库的 SQL 查询）和“合成”。
然后，我们展示如何构建基于模式的 TableIndex，以便在查询时动态检索相关表。
最后，我们展示如何独立定义 Text-to-SQL 检索器。

注意：任何 Text-to-SQL 应用都应该注意，执行任意 SQL 查询可能存在安全风险。建议根据需要采取预防措施，例如使用受限角色、只读数据库、沙箱等。

如果您在 colab 上打开此 Notebook，您可能需要安装 LlamaIndex 🦙。

In [ ]

已复制！

%pip install llama-index-llms-openai
%pip install llama-index-llms-openai

In [ ]

已复制！

!pip install llama-index
!pip install llama-index

In [ ]

已复制！

import os
import openai
import os import openai

In [ ]

已复制！

os.environ["OPENAI_API_KEY"] = "sk-.."
openai.api_key = os.environ["OPENAI_API_KEY"]
os.environ["OPENAI_API_KEY"] = "sk-.." openai.api_key = os.environ["OPENAI_API_KEY"]

In [ ]

已复制！

# import logging
# import sys

# logging.basicConfig(stream=sys.stdout, level=logging.INFO)
# logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
# import logging # import sys # logging.basicConfig(stream=sys.stdout, level=logging.INFO) # logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

In [ ]

已复制！

from IPython.display import Markdown, display
from IPython.display import Markdown, display

创建数据库模式¶

我们使用流行的 SQL 数据库工具包 sqlalchemy 来创建一个空的 city_stats 表

In [ ]

已复制！





from sqlalchemy import (
    create_engine,
    MetaData,
    Table,
    Column,
    String,
    Integer,
    select,
)
from sqlalchemy import ( create_engine, MetaData, Table, Column, String, Integer, select, )

In [ ]

已复制！

engine = create_engine("sqlite:///:memory:")
metadata_obj = MetaData()
engine = create_engine("sqlite:///:memory:") metadata_obj = MetaData()

In [ ]

已复制！





# create city SQL table
table_name = "city_stats"
city_stats_table = Table(
    table_name,
    metadata_obj,
    Column("city_name", String(16), primary_key=True),
    Column("population", Integer),
    Column("country", String(16), nullable=False),
)
metadata_obj.create_all(engine)
# 创建城市 SQL 表 table_name = "city_stats" city_stats_table = Table( table_name, metadata_obj, Column("city_name", String(16), primary_key=True), Column("population", Integer), Column("country", String(16), nullable=False), ) metadata_obj.create_all(engine)

定义 SQL 数据库¶

我们首先定义我们的 SQLDatabase 抽象（SQLAlchemy 的一个轻量级包装器）。

In [ ]

已复制！

from llama_index.core import SQLDatabase
from llama_index.llms.openai import OpenAI
from llama_index.core import SQLDatabase from llama_index.llms.openai import OpenAI

In [ ]

已复制！

llm = OpenAI(temperature=0.1, model="gpt-3.5-turbo")
llm = OpenAI(temperature=0.1, model="gpt-3.5-turbo")

In [ ]

已复制！

sql_database = SQLDatabase(engine, include_tables=["city_stats"])
sql_database = SQLDatabase(engine, include_tables=["city_stats"])

我们向 SQL 数据库添加一些测试数据。

In [ ]

已复制！





sql_database = SQLDatabase(engine, include_tables=["city_stats"])
from sqlalchemy import insert

rows = [
    {"city_name": "Toronto", "population": 2930000, "country": "Canada"},
    {"city_name": "Tokyo", "population": 13960000, "country": "Japan"},
    {
        "city_name": "Chicago",
        "population": 2679000,
        "country": "United States",
    },
    {"city_name": "Seoul", "population": 9776000, "country": "South Korea"},
]
for row in rows:
    stmt = insert(city_stats_table).values(**row)
    with engine.begin() as connection:
        cursor = connection.execute(stmt)
sql_database = SQLDatabase(engine, include_tables=["city_stats"]) from sqlalchemy import insert rows = [ {"city_name": "Toronto", "population": 2930000, "country": "Canada"}, {"city_name": "Tokyo", "population": 13960000, "country": "Japan"}, { "city_name": "Chicago", "population": 2679000, "country": "United States", }, {"city_name": "Seoul", "population": 9776000, "country": "South Korea"}, ] for row in rows: stmt = insert(city_stats_table).values(**row) with engine.begin() as connection: cursor = connection.execute(stmt)

In [ ]

已复制！





# view current table
stmt = select(
    city_stats_table.c.city_name,
    city_stats_table.c.population,
    city_stats_table.c.country,
).select_from(city_stats_table)

with engine.connect() as connection:
    results = connection.execute(stmt).fetchall()
    print(results)
# 查看当前表 stmt = select( city_stats_table.c.city_name, city_stats_table.c.population, city_stats_table.c.country, ).select_from(city_stats_table) with engine.connect() as connection: results = connection.execute(stmt).fetchall() print(results)

[('Toronto', 2930000, 'Canada'), ('Tokyo', 13960000, 'Japan'), ('Chicago', 2679000, 'United States'), ('Seoul', 9776000, 'South Korea')]

查询索引¶

我们首先展示如何执行原始 SQL 查询，它直接在表上执行。

In [ ]

已复制！

from sqlalchemy import text

with engine.connect() as con:
    rows = con.execute(text("SELECT city_name from city_stats"))
    for row in rows:
        print(row)
from sqlalchemy import text with engine.connect() as con: rows = con.execute(text("SELECT city_name from city_stats")) for row in rows: print(row)

('Chicago',)
('Seoul',)
('Tokyo',)
('Toronto',)

第 1 部分：Text-to-SQL 查询引擎¶

构建好 SQL 数据库后，我们可以使用 NLSQLTableQueryEngine 来构建自然语言查询，这些查询会被合成为 SQL 查询。

注意，我们需要指定要在此查询引擎中使用的表。如果不指定，查询引擎将拉取所有模式上下文，这可能会超出 LLM 的上下文窗口。

In [ ]

已复制！

from llama_index.core.query_engine import NLSQLTableQueryEngine

query_engine = NLSQLTableQueryEngine(
    sql_database=sql_database, tables=["city_stats"], llm=llm
)
query_str = "Which city has the highest population?"
response = query_engine.query(query_str)
from llama_index.core.query_engine import NLSQLTableQueryEngine query_engine = NLSQLTableQueryEngine( sql_database=sql_database, tables=["city_stats"], llm=llm ) query_str = "Which city has the highest population?" response = query_engine.query(query_str)

In [ ]

已复制！

display(Markdown(f"<b>{response}</b>"))
display(Markdown(f"{response}"))

人口最多的城市是东京。

如果您可以事先指定要查询的表，或者所有表模式加上其余提示词的总大小适合您的上下文窗口，则应该使用此查询引擎。

第 2 部分：用于 Text-to-SQL 的查询时表检索¶

如果我们事先不知道要使用哪个表，并且表模式的总大小超出了您的上下文窗口大小，我们应该将表模式存储在索引中，以便在查询时可以检索到正确的模式。

实现此目的的方法是使用 SQLTableNodeMapping 对象，它接受 SQLDatabase 并为传递给 ObjectIndex 构造函数的每个 SQLTableSchema 对象生成一个 Node 对象。

In [ ]

已复制！





from llama_index.core.indices.struct_store.sql_query import (
    SQLTableRetrieverQueryEngine,
)
from llama_index.core.objects import (
    SQLTableNodeMapping,
    ObjectIndex,
    SQLTableSchema,
)
from llama_index.core import VectorStoreIndex

# set Logging to DEBUG for more detailed outputs
table_node_mapping = SQLTableNodeMapping(sql_database)
table_schema_objs = [
    (SQLTableSchema(table_name="city_stats"))
]  # add a SQLTableSchema for each table

obj_index = ObjectIndex.from_objects(
    table_schema_objs,
    table_node_mapping,
    VectorStoreIndex,
)
query_engine = SQLTableRetrieverQueryEngine(
    sql_database, obj_index.as_retriever(similarity_top_k=1)
)
from llama_index.core.indices.struct_store.sql_query import ( SQLTableRetrieverQueryEngine, ) from llama_index.core.objects import ( SQLTableNodeMapping, ObjectIndex, SQLTableSchema, ) from llama_index.core import VectorStoreIndex # set Logging to DEBUG for more detailed outputs table_node_mapping = SQLTableNodeMapping(sql_database) table_schema_objs = [ (SQLTableSchema(table_name="city_stats")) ] # add a SQLTableSchema for each table obj_index = ObjectIndex.from_objects( table_schema_objs, table_node_mapping, VectorStoreIndex, ) query_engine = SQLTableRetrieverQueryEngine( sql_database, obj_index.as_retriever(similarity_top_k=1) )

现在我们可以使用 SQLTableRetrieverQueryEngine 并对其进行查询以获取响应。

In [ ]

已复制！

response = query_engine.query("Which city has the highest population?")
display(Markdown(f"<b>{response}</b>"))
response = query_engine.query("Which city has the highest population?") display(Markdown(f"{response}"))

人口最多的城市是东京。

In [ ]

已复制！

# you can also fetch the raw result from SQLAlchemy!
response.metadata["result"]
# 您也可以从 SQLAlchemy 获取原始结果！ response.metadata["result"]

Out[ ]

[('Tokyo',)]

您还可以为您定义的每个表模式添加额外的上下文信息。

In [ ]

已复制！





# manually set context text
city_stats_text = (
    "This table gives information regarding the population and country of a"
    " given city.\nThe user will query with codewords, where 'foo' corresponds"
    " to population and 'bar'corresponds to city."
)

table_node_mapping = SQLTableNodeMapping(sql_database)
table_schema_objs = [
    (SQLTableSchema(table_name="city_stats", context_str=city_stats_text))
]
# 手动设置上下文文本 city_stats_text = ( "此表提供有关给定城市的 人口 和 国家 的信息。\n用户将使用代码词进行查询，其中 'foo' 对应于人口，'bar' 对应于城市。" ) table_node_mapping = SQLTableNodeMapping(sql_database) table_schema_objs = [ (SQLTableSchema(table_name="city_stats", context_str=city_stats_text)) ]

第 3 部分：Text-to-SQL 检索器¶

目前，我们的 Text-to-SQL 能力被打包在一个查询引擎中，它包含检索和合成。

您可以独立使用 SQL 检索器。我们向您展示一些您可以尝试的不同参数，并展示如何将其接入我们的 RetrieverQueryEngine 以获得大致相同的结果。

In [ ]

已复制！

from llama_index.core.retrievers import NLSQLRetriever

# default retrieval (return_raw=True)
nl_sql_retriever = NLSQLRetriever(
    sql_database, tables=["city_stats"], return_raw=True
)
from llama_index.core.retrievers import NLSQLRetriever # 默认检索 (return_raw=True) nl_sql_retriever = NLSQLRetriever( sql_database, tables=["city_stats"], return_raw=True )

In [ ]

已复制！

results = nl_sql_retriever.retrieve(
    "Return the top 5 cities (along with their populations) with the highest population."
)
results = nl_sql_retriever.retrieve( "Return the top 5 cities (along with their populations) with the highest population." )

In [ ]

已复制！

from llama_index.core.response.notebook_utils import display_source_node

for n in results:
    display_source_node(n)
from llama_index.core.response.notebook_utils import display_source_node for n in results: display_source_node(n)

节点 ID: 458f723e-f1ac-4423-917a-522a71763390
相似度: None
文本: [('Tokyo', 13960000), ('Seoul', 9776000), ('Toronto', 2930000), ('Chicago', 2679000)]

In [ ]

已复制！

# default retrieval (return_raw=False)
nl_sql_retriever = NLSQLRetriever(
    sql_database, tables=["city_stats"], return_raw=False
)
# 默认检索 (return_raw=False) nl_sql_retriever = NLSQLRetriever( sql_database, tables=["city_stats"], return_raw=False )

In [ ]

已复制！

results = nl_sql_retriever.retrieve(
    "Return the top 5 cities (along with their populations) with the highest population."
)
results = nl_sql_retriever.retrieve( "Return the top 5 cities (along with their populations) with the highest population." )

In [ ]

已复制！

# NOTE: all the content is in the metadata
for n in results:
    display_source_node(n, show_source_metadata=True)
# 注意：所有内容都在元数据中 for n in results: display_source_node(n, show_source_metadata=True)

节点ID: 7c0e4c94-c9a6-4917-aa3f-e3b3f4cbcd5c
相似度: None
文本
元数据: {'city_name': 'Tokyo', 'population': 13960000}

节点ID: 3c1d1caa-cec2-451e-8fd1-adc944e1d050
相似度: None
文本
元数据: {'city_name': 'Seoul', 'population': 9776000}

节点ID: fb9f9b25-b913-4dde-a0e3-6111f704aea9
相似度: None
文本
元数据: {'city_name': 'Toronto', 'population': 2930000}

节点ID: c31ba8e7-de5d-4f28-a464-5e0339547c70
相似度: None
文本
元数据: {'city_name': 'Chicago', 'population': 2679000}

接入我们的 `RetrieverQueryEngine`¶

我们将 SQL Retriever 与标准的 RetrieverQueryEngine 组合以合成响应。结果与我们打包的 Text-to-SQL 查询引擎大致相似。

In [ ]

已复制！

from llama_index.core.query_engine import RetrieverQueryEngine

query_engine = RetrieverQueryEngine.from_args(nl_sql_retriever)
from llama_index.core.query_engine import RetrieverQueryEngine query_engine = RetrieverQueryEngine.from_args(nl_sql_retriever)

In [ ]

已复制！

response = query_engine.query(
    "Return the top 5 cities (along with their populations) with the highest population."
)
response = query_engine.query( "Return the top 5 cities (along with their populations) with the highest population." )

In [ ]

已复制！

print(str(response))
print(str(response))

The top 5 cities with the highest population are:

1. Tokyo - 13,960,000
2. Seoul - 9,776,000
3. Toronto - 2,930,000
4. Chicago - 2,679,000

Text-to-SQL 指南（查询引擎 + 检索器）¶

创建数据库模式¶

定义 SQL 数据库¶

查询索引¶

第 1 部分：Text-to-SQL 查询引擎¶

第 2 部分：用于 Text-to-SQL 的查询时表检索¶

第 3 部分：Text-to-SQL 检索器¶

接入我们的 RetrieverQueryEngine¶

接入我们的 `RetrieverQueryEngine`¶