DataFrame 结构化数据提取¶

这个演示展示了如何从原始文本中提取表格形式的 DataFrame。

这直接受到了 jxnl 在此处的 DataFrame 示例的启发：https://github.com/jxnl/openai_function_call/blob/main/auto_dataframe.py。

我们将展示不同复杂程度的示例，所有这些都由 OpenAI Function API 提供支持

(更多代码) 如何使用我们的 OpenAIPydanticProgram 构建自己的提取器
(更少代码) 使用我们现成的 DFFullProgram 和 DFRowsProgram 对象

自己构建一个 DF 提取器 (使用 OpenAIPydanticProgram)¶

我们的 OpenAIPydanticProgram 是对支持函数调用的 OpenAI LLM 的封装——它将以 Pydantic 对象的形式返回结构化输出。

我们导入 DataFrame 和 DataFrameRowsOnly 对象。

要创建一个输出提取器，你只需 1) 指定相关的 Pydantic 对象，以及 2) 添加正确的 Prompt

如果你在 Colab 上打开此 Notebook，你可能需要安装 LlamaIndex 🦙。

In [ ]

已复制!

%pip install llama-index-llms-openai
%pip install llama-index-program-openai
%pip install llama-index-llms-openai %pip install llama-index-program-openai

In [ ]

已复制!

!pip install llama-index
!pip install llama-index

In [ ]

已复制!





from llama_index.program.openai import OpenAIPydanticProgram
from llama_index.core.program import (
    DFFullProgram,
    DataFrame,
    DataFrameRowsOnly,
)
from llama_index.llms.openai import OpenAI
from llama_index.program.openai import OpenAIPydanticProgram from llama_index.core.program import ( DFFullProgram, DataFrame, DataFrameRowsOnly, ) from llama_index.llms.openai import OpenAI

In [ ]

已复制!





program = OpenAIPydanticProgram.from_defaults(
    output_cls=DataFrame,
    llm=OpenAI(temperature=0, model="gpt-4-0613"),
    prompt_template_str=(
        "Please extract the following query into a structured data according"
        " to: {input_str}.Please extract both the set of column names and a"
        " set of rows."
    ),
    verbose=True,
)
program = OpenAIPydanticProgram.from_defaults( output_cls=DataFrame, llm=OpenAI(temperature=0, model="gpt-4-0613"), prompt_template_str=( "Please extract the following query into a structured data according" " to: {input_str}.Please extract both the set of column names and a" " set of rows." ), verbose=True, )

In [ ]

已复制!





# NOTE: the test example is taken from jxnl's repo

response_obj = program(
    input_str="""My name is John and I am 25 years old. I live in 
        New York and I like to play basketball. His name is 
        Mike and he is 30 years old. He lives in San Francisco 
        and he likes to play baseball. Sarah is 20 years old 
        and she lives in Los Angeles. She likes to play tennis.
        Her name is Mary and she is 35 years old. 
        She lives in Chicago."""
)
response_obj
# NOTE: the test example is taken from jxnl's repo response_obj = program( input_str="""My name is John and I am 25 years old. I live in New York and I like to play basketball. His name is Mike and he is 30 years old. He lives in San Francisco and he likes to play baseball. Sarah is 20 years old and she lives in Los Angeles. She likes to play tennis. Her name is Mary and she is 35 years old. She lives in Chicago.""" ) response_obj

Function call: DataFrame with args: {
  "columns": [
    {
      "column_name": "Name",
      "column_desc": "Name of the person"
    },
    {
      "column_name": "Age",
      "column_desc": "Age of the person"
    },
    {
      "column_name": "City",
      "column_desc": "City where the person lives"
    },
    {
      "column_name": "Hobby",
      "column_desc": "What the person likes to do"
    }
  ],
  "rows": [
    {
      "row_values": ["John", 25, "New York", "play basketball"]
    },
    {
      "row_values": ["Mike", 30, "San Francisco", "play baseball"]
    },
    {
      "row_values": ["Sarah", 20, "Los Angeles", "play tennis"]
    },
    {
      "row_values": ["Mary", 35, "Chicago", "play tennis"]
    }
  ]
}

Out[ ]

DataFrame(description=None, columns=[DataFrameColumn(column_name='Name', column_desc='Name of the person'), DataFrameColumn(column_name='Age', column_desc='Age of the person'), DataFrameColumn(column_name='City', column_desc='City where the person lives'), DataFrameColumn(column_name='Hobby', column_desc='What the person likes to do')], rows=[DataFrameRow(row_values=['John', 25, 'New York', 'play basketball']), DataFrameRow(row_values=['Mike', 30, 'San Francisco', 'play baseball']), DataFrameRow(row_values=['Sarah', 20, 'Los Angeles', 'play tennis']), DataFrameRow(row_values=['Mary', 35, 'Chicago', 'play tennis'])])

In [ ]

已复制!





program = OpenAIPydanticProgram.from_defaults(
    output_cls=DataFrameRowsOnly,
    llm=OpenAI(temperature=0, model="gpt-4-0613"),
    prompt_template_str=(
        "Please extract the following text into a structured data:"
        " {input_str}. The column names are the following: ['Name', 'Age',"
        " 'City', 'Favorite Sport']. Do not specify additional parameters that"
        " are not in the function schema. "
    ),
    verbose=True,
)
program = OpenAIPydanticProgram.from_defaults( output_cls=DataFrameRowsOnly, llm=OpenAI(temperature=0, model="gpt-4-0613"), prompt_template_str=( "Please extract the following text into a structured data:" " {input_str}. The column names are the following: ['Name', 'Age'," " 'City', 'Favorite Sport']. Do not specify additional parameters that" " are not in the function schema. " ), verbose=True, )

In [ ]

已复制!





program(
    input_str="""My name is John and I am 25 years old. I live in 
        New York and I like to play basketball. His name is 
        Mike and he is 30 years old. He lives in San Francisco 
        and he likes to play baseball. Sarah is 20 years old 
        and she lives in Los Angeles. She likes to play tennis.
        Her name is Mary and she is 35 years old. 
        She lives in Chicago."""
)
program( input_str="""My name is John and I am 25 years old. I live in New York and I like to play basketball. His name is Mike and he is 30 years old. He lives in San Francisco and he likes to play baseball. Sarah is 20 years old and she lives in Los Angeles. She likes to play tennis. Her name is Mary and she is 35 years old. She lives in Chicago.""" )

Function call: DataFrameRowsOnly with args: {
  "rows": [
    {
      "row_values": ["John", 25, "New York", "basketball"]
    },
    {
      "row_values": ["Mike", 30, "San Francisco", "baseball"]
    },
    {
      "row_values": ["Sarah", 20, "Los Angeles", "tennis"]
    },
    {
      "row_values": ["Mary", 35, "Chicago", ""]
    }
  ]
}

Out[ ]

DataFrameRowsOnly(rows=[DataFrameRow(row_values=['John', 25, 'New York', 'basketball']), DataFrameRow(row_values=['Mike', 30, 'San Francisco', 'baseball']), DataFrameRow(row_values=['Sarah', 20, 'Los Angeles', 'tennis']), DataFrameRow(row_values=['Mary', 35, 'Chicago', ''])])

使用我们的 DataFrame 程序¶

我们为 DFFullProgram 和 DFRowsProgram 提供了便捷封装器。这比通过 OpenAIPydanticProgram 指定所有细节提供了更简单的对象创建接口。

In [ ]

已复制!





from llama_index.program.openai import OpenAIPydanticProgram
from llama_index.core.program import DFFullProgram, DFRowsProgram
import pandas as pd

# initialize empty df
df = pd.DataFrame(
    {
        "Name": pd.Series(dtype="str"),
        "Age": pd.Series(dtype="int"),
        "City": pd.Series(dtype="str"),
        "Favorite Sport": pd.Series(dtype="str"),
    }
)

# initialize program, using existing df as schema
df_rows_program = DFRowsProgram.from_defaults(
    pydantic_program_cls=OpenAIPydanticProgram, df=df
)
from llama_index.program.openai import OpenAIPydanticProgram from llama_index.core.program import DFFullProgram, DFRowsProgram import pandas as pd # initialize empty df df = pd.DataFrame( { "Name": pd.Series(dtype="str"), "Age": pd.Series(dtype="int"), "City": pd.Series(dtype="str"), "Favorite Sport": pd.Series(dtype="str"), } ) # initialize program, using existing df as schema df_rows_program = DFRowsProgram.from_defaults( pydantic_program_cls=OpenAIPydanticProgram, df=df )

In [ ]

已复制!





# parse text, using existing df as schema
result_obj = df_rows_program(
    input_str="""My name is John and I am 25 years old. I live in 
        New York and I like to play basketball. His name is 
        Mike and he is 30 years old. He lives in San Francisco 
        and he likes to play baseball. Sarah is 20 years old 
        and she lives in Los Angeles. She likes to play tennis.
        Her name is Mary and she is 35 years old. 
        She lives in Chicago."""
)
# parse text, using existing df as schema result_obj = df_rows_program( input_str="""My name is John and I am 25 years old. I live in New York and I like to play basketball. His name is Mike and he is 30 years old. He lives in San Francisco and he likes to play baseball. Sarah is 20 years old and she lives in Los Angeles. She likes to play tennis. Her name is Mary and she is 35 years old. She lives in Chicago.""" )

In [ ]

已复制!

result_obj.to_df(existing_df=df)
result_obj.to_df(existing_df=df)

/Users/jerryliu/Programming/gpt_index/llama_index/program/predefined/df.py:65: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  return existing_df.append(new_df, ignore_index=True)

Out[ ]

	姓名	年龄	城市	喜欢的运动
0	John	25	纽约	篮球
1	Mike	30	旧金山	棒球
2	Sarah	20	洛杉矶	网球
3	Mary	35	芝加哥

In [ ]

已复制!

# initialize program that can do joint schema extraction and structured data extraction
df_full_program = DFFullProgram.from_defaults(
    pydantic_program_cls=OpenAIPydanticProgram,
)
# initialize program that can do joint schema extraction and structured data extraction df_full_program = DFFullProgram.from_defaults( pydantic_program_cls=OpenAIPydanticProgram, )

In [ ]

已复制!





result_obj = df_full_program(
    input_str="""My name is John and I am 25 years old. I live in 
        New York and I like to play basketball. His name is 
        Mike and he is 30 years old. He lives in San Francisco 
        and he likes to play baseball. Sarah is 20 years old 
        and she lives in Los Angeles. She likes to play tennis.
        Her name is Mary and she is 35 years old. 
        She lives in Chicago."""
)
result_obj = df_full_program( input_str="""My name is John and I am 25 years old. I live in New York and I like to play basketball. His name is Mike and he is 30 years old. He lives in San Francisco and he likes to play baseball. Sarah is 20 years old and she lives in Los Angeles. She likes to play tennis. Her name is Mary and she is 35 years old. She lives in Chicago.""" )

In [ ]

已复制!

result_obj.to_df()
result_obj.to_df()

Out[ ]

	姓名	年龄	地点	爱好
0	John	25	纽约	篮球
1	Mike	30	旧金山	棒球
2	Sarah	20	洛杉矶	网球
3	Mary	35	芝加哥

In [ ]

已复制!





# initialize empty df
df = pd.DataFrame(
    {
        "City": pd.Series(dtype="str"),
        "State": pd.Series(dtype="str"),
        "Population": pd.Series(dtype="int"),
    }
)

# initialize program, using existing df as schema
df_rows_program = DFRowsProgram.from_defaults(
    pydantic_program_cls=OpenAIPydanticProgram, df=df
)
# initialize empty df df = pd.DataFrame( { "City": pd.Series(dtype="str"), "State": pd.Series(dtype="str"), "Population": pd.Series(dtype="int"), } ) # initialize program, using existing df as schema df_rows_program = DFRowsProgram.from_defaults( pydantic_program_cls=OpenAIPydanticProgram, df=df )

In [ ]

已复制!





input_text = """San Francisco is in California, has a population of 800,000. 
New York City is the most populous city in the United States. \
With a 2020 population of 8,804,190 distributed over 300.46 square miles (778.2 km2), \
New York City is the most densely populated major city in the United States.
New York City is in New York State.
Boston (US: /ˈbɔːstən/),[8] officially the City of Boston, is the capital and largest city of the Commonwealth of Massachusetts \
and the cultural and financial center of the New England region of the Northeastern United States. \
The city boundaries encompass an area of about 48.4 sq mi (125 km2)[9] and a population of 675,647 as of 2020.[4]
"""

# parse text, using existing df as schema
result_obj = df_rows_program(input_str=input_text)
input_text = """San Francisco is in California, has a population of 800,000. New York City is the most populous city in the United States. \ With a 2020 population of 8,804,190 distributed over 300.46 square miles (778.2 km2), \ New York City is the most densely populated major city in the United States. New York City is in New York State. Boston (US: /ˈbɔːstən/),[8] officially the City of Boston, is the capital and largest city of the Commonwealth of Massachusetts \ and the cultural and financial center of the New England region of the Northeastern United States. \ The city boundaries encompass an area of about 48.4 sq mi (125 km2)[9] and a population of 675,647 as of 2020.[4] """ # parse text, using existing df as schema result_obj = df_rows_program(input_str=input_text)

In [ ]

已复制!

new_df = result_obj.to_df(existing_df=df)
new_df
new_df = result_obj.to_df(existing_df=df) new_df

/Users/jerryliu/Programming/gpt_index/llama_index/program/predefined/df.py:65: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  return existing_df.append(new_df, ignore_index=True)

Out[ ]

	城市	州	人口
0	旧金山	加利福尼亚	800000
1	纽约市	纽约	8804190
2	波士顿	马萨诸塞州	675647