OpenAI Pydantic 程序¶

本指南将向您展示如何使用新的 OpenAI API 通过 LlamaIndex 生成结构化数据。用户只需指定一个 Pydantic 对象。

我们将演示两种设置

提取到 Album 对象中（该对象可以包含 Song 对象列表）
提取到 DirectoryTree 对象中（该对象可以包含递归的 Node 对象）

提取到 `Album` 中¶

这是一个将输出解析为 Album schema 的简单示例，Album schema 可以包含多首歌曲。

如果您在 colab 上打开此 Notebook，可能需要安装 LlamaIndex 🦙。

In [ ]

已复制！

%pip install llama-index-llms-openai
%pip install llama-index-program-openai
%pip install llama-index-llms-openai %pip install llama-index-program-openai

In [ ]

已复制！

%pip install llama-index
%pip install llama-index

In [ ]

已复制！

from pydantic import BaseModel
from typing import List

from llama_index.program.openai import OpenAIPydanticProgram
from pydantic import BaseModel from typing import List from llama_index.program.openai import OpenAIPydanticProgram

模型中不带 Docstring¶

定义输出 schema（不带 docstring）

In [ ]

已复制！

class Song(BaseModel):
    title: str
    length_seconds: int

class Album(BaseModel):
    name: str
    artist: str
    songs: List[Song]
class Song(BaseModel): title: str length_seconds: int class Album(BaseModel): name: str artist: str songs: List[Song]

定义 OpenAI Pydantic 程序

In [ ]

已复制！





prompt_template_str = """\
Generate an example album, with an artist and a list of songs. \
Using the movie {movie_name} as inspiration.\
"""
program = OpenAIPydanticProgram.from_defaults(
    output_cls=Album, prompt_template_str=prompt_template_str, verbose=True
)
prompt_template_str = """\ 生成一个包含艺术家和歌曲列表的示例专辑。 \ 以电影 {movie_name} 为灵感。\ """ program = OpenAIPydanticProgram.from_defaults( output_cls=Album, prompt_template_str=prompt_template_str, verbose=True )

运行程序获取结构化输出。

In [ ]

已复制！

output = program(
    movie_name="The Shining", description="Data model for an album."
)
output = program( movie_name="The Shining", description="Data model for an album." )

Function call: Album with args: {
  "name": "The Shining",
  "artist": "Various Artists",
  "songs": [
    {
      "title": "Main Title",
      "length_seconds": 180
    },
    {
      "title": "Opening Credits",
      "length_seconds": 120
    },
    {
      "title": "The Overlook Hotel",
      "length_seconds": 240
    },
    {
      "title": "Redrum",
      "length_seconds": 150
    },
    {
      "title": "Here's Johnny!",
      "length_seconds": 200
    }
  ]
}

模型中带 Docstring¶

In [ ]

已复制！

class Song(BaseModel):
    """Data model for a song."""

    title: str
    length_seconds: int

class Album(BaseModel):
    """Data model for an album."""

    name: str
    artist: str
    songs: List[Song]
class Song(BaseModel): """歌曲的数据模型。""" title: str length_seconds: int class Album(BaseModel): """专辑的数据模型。""" name: str artist: str songs: List[Song]

In [ ]

已复制！





prompt_template_str = """\
Generate an example album, with an artist and a list of songs. \
Using the movie {movie_name} as inspiration.\
"""
program = OpenAIPydanticProgram.from_defaults(
    output_cls=Album, prompt_template_str=prompt_template_str, verbose=True
)
prompt_template_str = """\ 生成一个包含艺术家和歌曲列表的示例专辑。 \ 以电影 {movie_name} 为灵感。\ """ program = OpenAIPydanticProgram.from_defaults( output_cls=Album, prompt_template_str=prompt_template_str, verbose=True )

运行程序获取结构化输出。

In [ ]

已复制！

output = program(movie_name="The Shining")
output = program(movie_name="The Shining")

Function call: Album with args: {
  "name": "The Shining",
  "artist": "Various Artists",
  "songs": [
    {
      "title": "Main Title",
      "length_seconds": 180
    },
    {
      "title": "Opening Credits",
      "length_seconds": 120
    },
    {
      "title": "The Overlook Hotel",
      "length_seconds": 240
    },
    {
      "title": "Redrum",
      "length_seconds": 150
    },
    {
      "title": "Here's Johnny",
      "length_seconds": 200
    }
  ]
}

输出是一个有效的 Pydantic 对象，我们可以用它来调用函数/API。

In [ ]

已复制！

output
output

Out[ ]

Album(name='The Shining', artist='Various Artists', songs=[Song(title='Main Title', length_seconds=180), Song(title='Opening Credits', length_seconds=120), Song(title='The Overlook Hotel', length_seconds=240), Song(title='Redrum', length_seconds=150), Song(title="Here's Johnny", length_seconds=200)])

流式传输部分中间 Pydantic 对象¶

无需等待函数调用生成完整的 JSON，我们可以使用 program 的 stream_partial_objects() 方法来流式传输可用的有效中间 Pydantic 输出类实例 🔥

首先定义输出 Pydantic 类

In [ ]

已复制！

from pydantic import BaseModel, Field

class CharacterInfo(BaseModel):
    """Information about a character."""

    character_name: str
    name: str = Field(..., description="Name of the actor/actress")
    hometown: str

class Characters(BaseModel):
    """List of characters."""

    characters: list[CharacterInfo] = Field(default_factory=list)
from pydantic import BaseModel, Field class CharacterInfo(BaseModel): """关于角色的信息。""" character_name: str name: str = Field(..., description="演员姓名") hometown: str class Characters(BaseModel): """角色列表。""" characters: list[CharacterInfo] = Field(default_factory=list)

现在我们将使用提示模板初始化程序

In [ ]

已复制！

from llama_index.program.openai import OpenAIPydanticProgram

prompt_template_str = "Information about 3 characters from the movie: {movie}"

program = OpenAIPydanticProgram.from_defaults(
    output_cls=Characters, prompt_template_str=prompt_template_str
)
from llama_index.program.openai import OpenAIPydanticProgram prompt_template_str = "来自电影 {movie} 的 3 个角色的信息" program = OpenAIPydanticProgram.from_defaults( output_cls=Characters, prompt_template_str=prompt_template_str )

最后，我们使用 stream_partial_objects() 方法流式传输部分对象

In [ ]

已复制！

for partial_object in program.stream_partial_objects(movie="Harry Potter"):
    # send the partial object to the frontend for better user experience
    print(partial_object)
for partial_object in program.stream_partial_objects(movie="Harry Potter"): # 将部分对象发送到前端以提供更好的用户体验 print(partial_object)

提取 `Album` 列表（使用并行函数调用）¶

借助 OpenAI 最新的并行函数调用功能，我们可以同时从一个提示中提取多个结构化数据！

为此，我们需要

选择最新的模型之一（例如 gpt-3.5-turbo-1106），以及
在我们的 OpenAIPydanticProgram 中将 allow_multiple 设置为 True（如果不设置，它将只返回第一个对象，并发出警告）。

In [ ]

已复制！





from llama_index.llms.openai import OpenAI

prompt_template_str = """\
Generate 4 albums about spring, summer, fall, and winter.
"""
program = OpenAIPydanticProgram.from_defaults(
    output_cls=Album,
    llm=OpenAI(model="gpt-3.5-turbo-1106"),
    prompt_template_str=prompt_template_str,
    allow_multiple=True,
    verbose=True,
)
from llama_index.llms.openai import OpenAI prompt_template_str = """\ 生成关于春季、夏季、秋季和冬季的 4 张专辑。 """ program = OpenAIPydanticProgram.from_defaults( output_cls=Album, llm=OpenAI(model="gpt-3.5-turbo-1106"), prompt_template_str=prompt_template_str, allow_multiple=True, verbose=True, )

In [ ]

已复制！

output = program()
output = program()

Function call: Album with args: {"name": "Spring", "artist": "Various Artists", "songs": [{"title": "Blossom", "length_seconds": 180}, {"title": "Sunshine", "length_seconds": 240}, {"title": "Renewal", "length_seconds": 200}]}
Function call: Album with args: {"name": "Summer", "artist": "Beach Boys", "songs": [{"title": "Beach Party", "length_seconds": 220}, {"title": "Heatwave", "length_seconds": 260}, {"title": "Vacation", "length_seconds": 180}]}
Function call: Album with args: {"name": "Fall", "artist": "Autumn Leaves", "songs": [{"title": "Golden Days", "length_seconds": 210}, {"title": "Harvest Moon", "length_seconds": 240}, {"title": "Crisp Air", "length_seconds": 190}]}
Function call: Album with args: {"name": "Winter", "artist": "Snowflakes", "songs": [{"title": "Frosty Morning", "length_seconds": 190}, {"title": "Snowfall", "length_seconds": 220}, {"title": "Cozy Nights", "length_seconds": 250}]}

输出是一个有效的 Pydantic 对象列表。

In [ ]

已复制！

output
output

Out[ ]

[Album(name='Spring', artist='Various Artists', songs=[Song(title='Blossom', length_seconds=180), Song(title='Sunshine', length_seconds=240), Song(title='Renewal', length_seconds=200)]),
 Album(name='Summer', artist='Beach Boys', songs=[Song(title='Beach Party', length_seconds=220), Song(title='Heatwave', length_seconds=260), Song(title='Vacation', length_seconds=180)]),
 Album(name='Fall', artist='Autumn Leaves', songs=[Song(title='Golden Days', length_seconds=210), Song(title='Harvest Moon', length_seconds=240), Song(title='Crisp Air', length_seconds=190)]),
 Album(name='Winter', artist='Snowflakes', songs=[Song(title='Frosty Morning', length_seconds=190), Song(title='Snowfall', length_seconds=220), Song(title='Cozy Nights', length_seconds=250)])]

提取到 `Album` 中（流式传输）¶

我们也支持通过 stream_list 函数流式传输对象列表。

这个想法完全归功于 openai_function_call 仓库：https://github.com/jxnl/openai_function_call/tree/main/examples/streaming_multitask

In [ ]

已复制！





prompt_template_str = "{input_str}"
program = OpenAIPydanticProgram.from_defaults(
    output_cls=Album,
    prompt_template_str=prompt_template_str,
    verbose=False,
)

output = program.stream_list(
    input_str="make up 5 random albums",
)
for obj in output:
    print(obj.json(indent=2))
prompt_template_str = "{input_str}" program = OpenAIPydanticProgram.from_defaults( output_cls=Album, prompt_template_str=prompt_template_str, verbose=False, ) output = program.stream_list( input_str="make up 5 random albums", ) for obj in output: print(obj.json(indent=2))

提取到 `DirectoryTree` 对象中¶

这直接受到了 jxnl 出色仓库的启发：https://github.com/jxnl/openai_function_call。

该仓库展示了如何使用 OpenAI 的函数 API 解析递归 Pydantic 对象。主要要求是将递归 Pydantic 对象“包装”在一个非递归对象中。

这里我们展示了一个“目录”设置的示例，其中 DirectoryTree 对象包装了递归的 Node 对象，用于解析文件结构。

In [ ]

已复制！

# NOTE: defining recursive objects in a notebook causes errors
from directory import DirectoryTree, Node
# 注意：在 notebook 中定义递归对象会导致错误 from directory import DirectoryTree, Node

In [ ]

已复制！

DirectoryTree.schema()
DirectoryTree.schema()

Out[ ]

{'title': 'DirectoryTree',
 'description': 'Container class representing a directory tree.\n\nArgs:\n    root (Node): The root node of the tree.',
 'type': 'object',
 'properties': {'root': {'title': 'Root',
   'description': 'Root folder of the directory tree',
   'allOf': [{'$ref': '#/definitions/Node'}]}},
 'required': ['root'],
 'definitions': {'NodeType': {'title': 'NodeType',
   'description': 'Enumeration representing the types of nodes in a filesystem.',
   'enum': ['file', 'folder'],
   'type': 'string'},
  'Node': {'title': 'Node',
   'description': 'Class representing a single node in a filesystem. Can be either a file or a folder.\nNote that a file cannot have children, but a folder can.\n\nArgs:\n    name (str): The name of the node.\n    children (List[Node]): The list of child nodes (if any).\n    node_type (NodeType): The type of the node, either a file or a folder.',
   'type': 'object',
   'properties': {'name': {'title': 'Name',
     'description': 'Name of the folder',
     'type': 'string'},
    'children': {'title': 'Children',
     'description': 'List of children nodes, only applicable for folders, files cannot have children',
     'type': 'array',
     'items': {'$ref': '#/definitions/Node'}},
    'node_type': {'description': 'Either a file or folder, use the name to determine which it could be',
     'default': 'file',
     'allOf': [{'$ref': '#/definitions/NodeType'}]}},
   'required': ['name']}}}

In [ ]

已复制！

program = OpenAIPydanticProgram.from_defaults(
    output_cls=DirectoryTree,
    prompt_template_str="{input_str}",
    verbose=True,
)
program = OpenAIPydanticProgram.from_defaults( output_cls=DirectoryTree, prompt_template_str="{input_str}", verbose=True, )

In [ ]

已复制！





input_str = """
root
├── folder1
│   ├── file1.txt
│   └── file2.txt
└── folder2
    ├── file3.txt
    └── subfolder1
        └── file4.txt
"""

output = program(input_str=input_str)
input_str = """ root ├── folder1 │ ├── file1.txt │ └── file2.txt └── folder2 ├── file3.txt └── subfolder1 └── file4.txt """ output = program(input_str=input_str)

Function call: DirectoryTree with args: {
  "root": {
    "name": "root",
    "children": [
      {
        "name": "folder1",
        "children": [
          {
            "name": "file1.txt",
            "children": [],
            "node_type": "file"
          },
          {
            "name": "file2.txt",
            "children": [],
            "node_type": "file"
          }
        ],
        "node_type": "folder"
      },
      {
        "name": "folder2",
        "children": [
          {
            "name": "file3.txt",
            "children": [],
            "node_type": "file"
          },
          {
            "name": "subfolder1",
            "children": [
              {
                "name": "file4.txt",
                "children": [],
                "node_type": "file"
              }
            ],
            "node_type": "folder"
          }
        ],
        "node_type": "folder"
      }
    ],
    "node_type": "folder"
  }
}

输出是一个包含递归 Node 对象的完整 DirectoryTree 结构。

In [ ]

已复制！

output
output

Out[ ]

DirectoryTree(root=Node(name='root', children=[Node(name='folder1', children=[Node(name='file1.txt', children=[], node_type=<NodeType.FILE: 'file'>), Node(name='file2.txt', children=[], node_type=<NodeType.FILE: 'file'>)], node_type=<NodeType.FOLDER: 'folder'>), Node(name='folder2', children=[Node(name='file3.txt', children=[], node_type=<NodeType.FILE: 'file'>), Node(name='subfolder1', children=[Node(name='file4.txt', children=[], node_type=<NodeType.FILE: 'file'>)], node_type=<NodeType.FOLDER: 'folder'>)], node_type=<NodeType.FOLDER: 'folder'>)], node_type=<NodeType.FOLDER: 'folder'>))

OpenAI Pydantic 程序¶

提取到 Album 中¶

模型中不带 Docstring¶

模型中带 Docstring¶

流式传输部分中间 Pydantic 对象¶

提取 Album 列表（使用并行函数调用）¶

提取到 Album 中（流式传输）¶

提取到 DirectoryTree 对象中¶

提取到 `Album` 中¶

提取 `Album` 列表（使用并行函数调用）¶

提取到 `Album` 中（流式传输）¶

提取到 `DirectoryTree` 对象中¶