Pandas 查询引擎¶
本指南将向您展示如何使用我们的 PandasQueryEngine
:利用大型语言模型 (LLM) 将自然语言转换为 Pandas Python 代码。
PandasQueryEngine
的输入是一个 Pandas 数据框 (dataframe),输出是一个响应。LLM 会推断需要执行哪些数据框操作来获取结果。
警告:此工具允许 LLM 访问 `eval` 函数。运行此工具的机器上可能会执行任意代码。尽管对代码进行了一定程度的过滤,但在没有严格沙箱隔离或虚拟机的情况下,不建议在生产环境中使用此工具。
如果您在 colab 上打开此 Notebook,您可能需要安装 LlamaIndex 🦙。
In [ ]
已复制!
!pip install llama-index llama-index-experimental
!pip install llama-index llama-index-experimental
In [ ]
已复制!
import logging
import sys
from IPython.display import Markdown, display
import pandas as pd
from llama_index.experimental.query_engine import PandasQueryEngine
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
import logging import sys from IPython.display import Markdown, display import pandas as pd from llama_index.experimental.query_engine import PandasQueryEngine logging.basicConfig(stream=sys.stdout, level=logging.INFO) logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
让我们从一个示例数据框 (DataFrame) 开始¶
这里我们加载一个包含城市和人口对的非常简单的数据框,并在其上运行 `PandasQueryEngine`。
通过设置 `verbose=True`,我们可以看到中间生成的指令。
In [ ]
已复制!
# Test on some sample data
df = pd.DataFrame(
{
"city": ["Toronto", "Tokyo", "Berlin"],
"population": [2930000, 13960000, 3645000],
}
)
# Test on some sample data df = pd.DataFrame( { "city": ["Toronto", "Tokyo", "Berlin"], "population": [2930000, 13960000, 3645000], } )
In [ ]
已复制!
query_engine = PandasQueryEngine(df=df, verbose=True)
query_engine = PandasQueryEngine(df=df, verbose=True)
In [ ]
已复制!
response = query_engine.query(
"What is the city with the highest population?",
)
response = query_engine.query( "What is the city with the highest population?", )
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK" HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK" > Pandas Instructions: ``` df['city'][df['population'].idxmax()] ``` > Pandas Output: Tokyo
In [ ]
已复制!
display(Markdown(f"<b>{response}</b>"))
display(Markdown(f"{response}"))
Tokyo
In [ ]
已复制!
# get pandas python instructions
print(response.metadata["pandas_instruction_str"])
# get pandas python instructions print(response.metadata["pandas_instruction_str"])
df['city'][df['population'].idxmax()]
我们还可以利用 LLM 来合成响应。
In [ ]
已复制!
query_engine = PandasQueryEngine(df=df, verbose=True, synthesize_response=True)
response = query_engine.query(
"What is the city with the highest population? Give both the city and population",
)
print(str(response))
query_engine = PandasQueryEngine(df=df, verbose=True, synthesize_response=True) response = query_engine.query( "What is the city with the highest population? Give both the city and population", ) print(str(response))
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK" HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK" > Pandas Instructions: ``` df.loc[df['population'].idxmax()] ``` > Pandas Output: city Tokyo population 13960000 Name: 1, dtype: object INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK" HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK" The city with the highest population is Tokyo, with a population of 13,960,000.
分析泰坦尼克号数据集 (Titanic Dataset)¶
泰坦尼克号数据集是机器学习入门中最流行的数据集之一 来源: https://www.kaggle.com/c/titanic
下载数据¶
In [ ]
已复制!
!wget 'https://raw.githubusercontent.com/jerryjliu/llama_index/main/docs/docs/examples/data/csv/titanic_train.csv' -O 'titanic_train.csv'
!wget 'https://raw.githubusercontent.com/jerryjliu/llama_index/main/docs/docs/examples/data/csv/titanic_train.csv' -O 'titanic_train.csv'
--2024-01-13 17:45:15-- https://raw.githubusercontent.com/jerryjliu/llama_index/main/docs/docs/examples/data/csv/titanic_train.csv Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8003::154, 2606:50c0:8002::154, 2606:50c0:8001::154, ... Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8003::154|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 57726 (56K) [text/plain] Saving to: ‘titanic_train.csv’ titanic_train.csv 100%[===================>] 56.37K --.-KB/s in 0.009s 2024-01-13 17:45:15 (6.45 MB/s) - ‘titanic_train.csv’ saved [57726/57726]
In [ ]
已复制!
df = pd.read_csv("./titanic_train.csv")
df = pd.read_csv("./titanic_train.csv")
In [ ]
已复制!
query_engine = PandasQueryEngine(df=df, verbose=True)
query_engine = PandasQueryEngine(df=df, verbose=True)
In [ ]
已复制!
response = query_engine.query(
"What is the correlation between survival and age?",
)
response = query_engine.query( "What is the correlation between survival and age?", )
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK" HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK" > Pandas Instructions: ``` df['survived'].corr(df['age']) ``` > Pandas Output: -0.07722109457217755
In [ ]
已复制!
display(Markdown(f"<b>{response}</b>"))
display(Markdown(f"{response}"))
-0.07722109457217755
In [ ]
已复制!
# get pandas python instructions
print(response.metadata["pandas_instruction_str"])
# get pandas python instructions print(response.metadata["pandas_instruction_str"])
df['survived'].corr(df['age'])
In [ ]
已复制!
from llama_index.core import PromptTemplate
from llama_index.core import PromptTemplate
In [ ]
已复制!
query_engine = PandasQueryEngine(df=df, verbose=True)
prompts = query_engine.get_prompts()
print(prompts["pandas_prompt"].template)
query_engine = PandasQueryEngine(df=df, verbose=True) prompts = query_engine.get_prompts() print(prompts["pandas_prompt"].template)
You are working with a pandas dataframe in Python. The name of the dataframe is `df`. This is the result of `print(df.head())`: {df_str} Follow these instructions: {instruction_str} Query: {query_str} Expression:
In [ ]
已复制!
print(prompts["response_synthesis_prompt"].template)
print(prompts["response_synthesis_prompt"].template)
Given an input question, synthesize a response from the query results. Query: {query_str} Pandas Instructions (optional): {pandas_instructions} Pandas Output: {pandas_output} Response:
您也可以更新提示词
In [ ]
已复制!
new_prompt = PromptTemplate(
"""\
You are working with a pandas dataframe in Python.
The name of the dataframe is `df`.
This is the result of `print(df.head())`:
{df_str}
Follow these instructions:
{instruction_str}
Query: {query_str}
Expression: """
)
query_engine.update_prompts({"pandas_prompt": new_prompt})
new_prompt = PromptTemplate( """\ You are working with a pandas dataframe in Python. The name of the dataframe is `df`. This is the result of `print(df.head())`: {df_str} Follow these instructions: {instruction_str} Query: {query_str} Expression: """ ) query_engine.update_prompts({"pandas_prompt": new_prompt})
这是指令字符串(您可以在初始化时通过传入 `instruction_str` 来定制)
In [ ]
已复制!
instruction_str = """\
1. Convert the query to executable Python code using Pandas.
2. The final line of code should be a Python expression that can be called with the `eval()` function.
3. The code should represent a solution to the query.
4. PRINT ONLY THE EXPRESSION.
5. Do not quote the expression.
"""
instruction_str = """\ 1. Convert the query to executable Python code using Pandas. 2. The final line of code should be a Python expression that can be called with the `eval()` function. 3. The code should represent a solution to the query. 4. PRINT ONLY THE EXPRESSION. 5. Do not quote the expression. """
使用查询管道语法实现查询引擎¶
如果您想学习如何使用我们的查询管道 (Query Pipeline) 语法和上面的提示词组件来构建自己的 Pandas 查询引擎,请查阅下面的教程。