Pandas DataFrames 上的查询管道¶
这是一个简单的示例,构建了一个查询管道,该管道可以在 Pandas DataFrame 上执行结构化操作以满足用户查询,并使用 LLM 推断操作集。
这可以视为我们 PandasQueryEngine
的“从零开始”版本。
警告:此工具允许 LLM 访问 eval
函数。在运行此工具的机器上可能发生任意代码执行。不建议在生产环境中使用此工具,并且需要严格的沙箱或虚拟机保护。
In [ ]
已复制!
%pip install llama-index-llms-openai llama-index-experimental
%pip install llama-index-llms-openai llama-index-experimental
In [ ]
已复制!
from llama_index.core.query_pipeline import (
QueryPipeline as QP,
Link,
InputComponent,
)
from llama_index.experimental.query_engine.pandas import (
PandasInstructionParser,
)
from llama_index.llms.openai import OpenAI
from llama_index.core import PromptTemplate
from llama_index.core.query_pipeline import ( QueryPipeline as QP, Link, InputComponent, ) from llama_index.experimental.query_engine.pandas import ( PandasInstructionParser, ) from llama_index.llms.openai import OpenAI from llama_index.core import PromptTemplate
下载数据¶
在这里我们加载 Titanic CSV 数据集。
In [ ]
已复制!
!wget 'https://raw.githubusercontent.com/jerryjliu/llama_index/main/docs/docs/examples/data/csv/titanic_train.csv' -O 'titanic_train.csv'
!wget 'https://raw.githubusercontent.com/jerryjliu/llama_index/main/docs/docs/examples/data/csv/titanic_train.csv' -O 'titanic_train.csv'
--2024-01-13 18:39:07-- https://raw.githubusercontent.com/jerryjliu/llama_index/main/docs/docs/examples/data/csv/titanic_train.csv Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8003::154, 2606:50c0:8001::154, 2606:50c0:8002::154, ... Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8003::154|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 57726 (56K) [text/plain] Saving to: ‘titanic_train.csv’ titanic_train.csv 100%[===================>] 56.37K --.-KB/s in 0.007s 2024-01-13 18:39:07 (7.93 MB/s) - ‘titanic_train.csv’ saved [57726/57726]
In [ ]
已复制!
import pandas as pd
df = pd.read_csv("./titanic_train.csv")
import pandas as pd df = pd.read_csv("./titanic_train.csv")
定义模块¶
在这里我们定义模块集
- Pandas 提示,用于从用户查询中推断 pandas 指令
- Pandas 输出解析器,用于在 dataframe 上执行 pandas 指令,并返回 dataframe
- 响应合成提示,用于根据 dataframe 合成最终响应
- LLM
pandas 输出解析器专门设计用于安全地执行 Python 代码。它包含许多安全检查,这些检查从头开始编写可能会很麻烦。这包括仅从一组批准的模块(例如,不允许导入会更改文件系统的模块如 os
)导入,并确保不会调用任何私有/双下划线方法。
In [ ]
已复制!
instruction_str = (
"1. Convert the query to executable Python code using Pandas.\n"
"2. The final line of code should be a Python expression that can be called with the `eval()` function.\n"
"3. The code should represent a solution to the query.\n"
"4. PRINT ONLY THE EXPRESSION.\n"
"5. Do not quote the expression.\n"
)
pandas_prompt_str = (
"You are working with a pandas dataframe in Python.\n"
"The name of the dataframe is `df`.\n"
"This is the result of `print(df.head())`:\n"
"{df_str}\n\n"
"Follow these instructions:\n"
"{instruction_str}\n"
"Query: {query_str}\n\n"
"Expression:"
)
response_synthesis_prompt_str = (
"Given an input question, synthesize a response from the query results.\n"
"Query: {query_str}\n\n"
"Pandas Instructions (optional):\n{pandas_instructions}\n\n"
"Pandas Output: {pandas_output}\n\n"
"Response: "
)
pandas_prompt = PromptTemplate(pandas_prompt_str).partial_format(
instruction_str=instruction_str, df_str=df.head(5)
)
pandas_output_parser = PandasInstructionParser(df)
response_synthesis_prompt = PromptTemplate(response_synthesis_prompt_str)
llm = OpenAI(model="gpt-3.5-turbo")
instruction_str = ( "1. Convert the query to executable Python code using Pandas.\n" "2. The final line of code should be a Python expression that can be called with the `eval()` function.\n" "3. The code should represent a solution to the query.\n" "4. PRINT ONLY THE EXPRESSION.\n" "5. Do not quote the expression.\n" ) pandas_prompt_str = ( "You are working with a pandas dataframe in Python.\n" "The name of the dataframe is `df`.\n" "This is the result of `print(df.head())`:\n" "{df_str}\n\n" "Follow these instructions:\n" "{instruction_str}\n" "Query: {query_str}\n\n" "Expression:" ) response_synthesis_prompt_str = ( "Given an input question, synthesize a response from the query results.\n" "Query: {query_str}\n\n" "Pandas Instructions (optional):\n{pandas_instructions}\n\n" "Pandas Output: {pandas_output}\n\n" "Response: " ) pandas_prompt = PromptTemplate(pandas_prompt_str).partial_format( instruction_str=instruction_str, df_str=df.head(5) ) pandas_output_parser = PandasInstructionParser(df) response_synthesis_prompt = PromptTemplate(response_synthesis_prompt_str) llm = OpenAI(model="gpt-3.5-turbo")
构建查询管道¶
看起来像这样:input query_str -> pandas_prompt -> llm1 -> pandas_output_parser -> response_synthesis_prompt -> llm2
连接到 response_synthesis_prompt 的其他连接:llm1 -> pandas_instructions,以及 pandas_output_parser -> pandas_output。
In [ ]
已复制!
qp = QP(
modules={
"input": InputComponent(),
"pandas_prompt": pandas_prompt,
"llm1": llm,
"pandas_output_parser": pandas_output_parser,
"response_synthesis_prompt": response_synthesis_prompt,
"llm2": llm,
},
verbose=True,
)
qp.add_chain(["input", "pandas_prompt", "llm1", "pandas_output_parser"])
qp.add_links(
[
Link("input", "response_synthesis_prompt", dest_key="query_str"),
Link(
"llm1", "response_synthesis_prompt", dest_key="pandas_instructions"
),
Link(
"pandas_output_parser",
"response_synthesis_prompt",
dest_key="pandas_output",
),
]
)
# add link from response synthesis prompt to llm2
qp.add_link("response_synthesis_prompt", "llm2")
qp = QP( modules={ "input": InputComponent(), "pandas_prompt": pandas_prompt, "llm1": llm, "pandas_output_parser": pandas_output_parser, "response_synthesis_prompt": response_synthesis_prompt, "llm2": llm, }, verbose=True, ) qp.add_chain(["input", "pandas_prompt", "llm1", "pandas_output_parser"]) qp.add_links( [ Link("input", "response_synthesis_prompt", dest_key="query_str"), Link( "llm1", "response_synthesis_prompt", dest_key="pandas_instructions" ), Link( "pandas_output_parser", "response_synthesis_prompt", dest_key="pandas_output", ), ] ) # add link from response synthesis prompt to llm2 qp.add_link("response_synthesis_prompt", "llm2")
运行查询¶
In [ ]
已复制!
response = qp.run(
query_str="What is the correlation between survival and age?",
)
response = qp.run( query_str="What is the correlation between survival and age?", )
> Running module input with input: query_str: What is the correlation between survival and age? > Running module pandas_prompt with input: query_str: What is the correlation between survival and age? > Running module llm1 with input: messages: You are working with a pandas dataframe in Python. The name of the dataframe is `df`. This is the result of `print(df.head())`: survived pclass name ... > Running module pandas_output_parser with input: input: assistant: df['survived'].corr(df['age']) > Running module response_synthesis_prompt with input: query_str: What is the correlation between survival and age? pandas_instructions: assistant: df['survived'].corr(df['age']) pandas_output: -0.07722109457217755 > Running module llm2 with input: messages: Given an input question, synthesize a response from the query results. Query: What is the correlation between survival and age? Pandas Instructions (optional): df['survived'].corr(df['age']) Pandas ...
In [ ]
已复制!
print(response.message.content)
print(response.message.content)
The correlation between survival and age is -0.0772. This indicates a weak negative correlation, suggesting that as age increases, the likelihood of survival slightly decreases.