OpenAI JSON 模式 vs. 函数调用进行数据提取¶
OpenAI 刚刚发布了 JSON 模式:这个新配置限制大型语言模型仅生成可以解析为有效 JSON 的字符串(但不保证针对任何 schema 进行验证)。
在此之前,从文本中提取结构化数据的最佳方法是通过函数调用。
在本 notebook 中,我们将探讨最新的JSON 模式和函数调用功能在结构化输出与提取方面的权衡。
更新:OpenAI 澄清,JSON 模式对于函数调用始终启用,对于普通消息则是可选启用 (https://community.openai.com/t/json-mode-vs-function-calling/476994/4)
生成合成数据¶
我们将首先生成一些合成数据用于我们的数据提取任务。让我们请大型语言模型生成一个假设的销售电话记录。
In [ ]
已复制!
%pip install llama-index-llms-openai
%pip install llama-index-program-openai
%pip install llama-index-llms-openai %pip install llama-index-program-openai
In [ ]
已复制!
from llama_index.llms.openai import OpenAI
llm = OpenAI(model="gpt-3.5-turbo-1106")
response = llm.complete(
"Generate a sales call transcript, use real names, talk about a product, discuss some action items"
)
from llama_index.llms.openai import OpenAI llm = OpenAI(model="gpt-3.5-turbo-1106") response = llm.complete( "Generate a sales call transcript, use real names, talk about a product, discuss some action items" )
In [ ]
已复制!
transcript = response.text
print(transcript)
transcript = response.text print(transcript)
[Phone rings] John: Hello, this is John. Sarah: Hi John, this is Sarah from XYZ Company. I'm calling to discuss our new product, the XYZ Widget, and see if it might be a good fit for your business. John: Hi Sarah, thanks for reaching out. I'm definitely interested in learning more about the XYZ Widget. Can you give me a quick overview of what it does? Sarah: Of course! The XYZ Widget is a cutting-edge tool that helps businesses streamline their workflow and improve productivity. It's designed to automate repetitive tasks and provide real-time data analytics to help you make informed decisions. John: That sounds really interesting. I can see how that could benefit our team. Do you have any case studies or success stories from other companies who have used the XYZ Widget? Sarah: Absolutely, we have several case studies that I can share with you. I'll send those over along with some additional information about the product. I'd also love to schedule a demo for you and your team to see the XYZ Widget in action. John: That would be great. I'll make sure to review the case studies and then we can set up a time for the demo. In the meantime, are there any specific action items or next steps we should take? Sarah: Yes, I'll send over the information and then follow up with you to schedule the demo. In the meantime, feel free to reach out if you have any questions or need further information. John: Sounds good, I appreciate your help Sarah. I'm looking forward to learning more about the XYZ Widget and seeing how it can benefit our business. Sarah: Thank you, John. I'll be in touch soon. Have a great day! John: You too, bye.
设置期望的 schema¶
让我们将期望的输出“形状”指定为一个 Pydantic 模型。
In [ ]
已复制!
from pydantic import BaseModel, Field
from typing import List
class CallSummary(BaseModel):
"""Data model for a call summary."""
summary: str = Field(
description="High-level summary of the call transcript. Should not exceed 3 sentences."
)
products: List[str] = Field(
description="List of products discussed in the call"
)
rep_name: str = Field(description="Name of the sales rep")
prospect_name: str = Field(description="Name of the prospect")
action_items: List[str] = Field(description="List of action items")
from pydantic import BaseModel, Field from typing import List class CallSummary(BaseModel): """Data model for a call summary.""" summary: str = Field( description="High-level summary of the call transcript. Should not exceed 3 sentences." ) products: List[str] = Field( description="List of products discussed in the call" ) rep_name: str = Field(description="Name of the sales rep") prospect_name: str = Field(description="Name of the prospect") action_items: List[str] = Field(description="List of action items")
使用函数调用进行数据提取¶
我们可以使用 LlamaIndex 中的 OpenAIPydanticProgram
模块来简化操作,只需定义一个提示模板,然后传入我们定义好的大型语言模型和 Pydantic 模型即可。
In [ ]
已复制!
from llama_index.program.openai import OpenAIPydanticProgram
from llama_index.core import ChatPromptTemplate
from llama_index.core.llms import ChatMessage
from llama_index.program.openai import OpenAIPydanticProgram from llama_index.core import ChatPromptTemplate from llama_index.core.llms import ChatMessage
In [ ]
已复制!
prompt = ChatPromptTemplate(
message_templates=[
ChatMessage(
role="system",
content=(
"You are an expert assitant for summarizing and extracting insights from sales call transcripts."
),
),
ChatMessage(
role="user",
content=(
"Here is the transcript: \n"
"------\n"
"{transcript}\n"
"------"
),
),
]
)
program = OpenAIPydanticProgram.from_defaults(
output_cls=CallSummary,
llm=llm,
prompt=prompt,
verbose=True,
)
prompt = ChatPromptTemplate( message_templates=[ ChatMessage( role="system", content=( "You are an expert assitant for summarizing and extracting insights from sales call transcripts." ), ), ChatMessage( role="user", content=( "Here is the transcript: \n" "------\n" "{transcript}\n" "------" ), ), ] ) program = OpenAIPydanticProgram.from_defaults( output_cls=CallSummary, llm=llm, prompt=prompt, verbose=True, )
In [ ]
已复制!
output = program(transcript=transcript)
output = program(transcript=transcript)
Function call: CallSummary with args: {"summary":"Sarah from XYZ Company called to discuss the new product, the XYZ Widget, which John expressed interest in. Sarah offered to share case studies and schedule a demo. They agreed to review the case studies and set up a time for the demo. The next steps include Sarah sending over information and following up to schedule the demo.","products":["XYZ Widget"],"rep_name":"Sarah","prospect_name":"John","action_items":["Review case studies","Schedule demo"]}
现在我们获得了期望的结构化数据,它是一个 Pydantic 模型。快速检查表明结果符合我们的预期。
In [ ]
已复制!
output.dict()
output.dict()
Out [ ]
{'summary': 'Sarah from XYZ Company called to discuss the new product, the XYZ Widget, which John expressed interest in. Sarah offered to share case studies and schedule a demo. They agreed to review the case studies and set up a time for the demo. The next steps include Sarah sending over information and following up to schedule the demo.', 'products': ['XYZ Widget'], 'rep_name': 'Sarah', 'prospect_name': 'John', 'action_items': ['Review case studies', 'Schedule demo']}
使用 JSON 模式进行数据提取¶
让我们尝试使用 JSON 模式而不是函数调用来做同样的事情
In [ ]
已复制!
prompt = ChatPromptTemplate(
message_templates=[
ChatMessage(
role="system",
content=(
"You are an expert assitant for summarizing and extracting insights from sales call transcripts.\n"
"Generate a valid JSON following the given schema below:\n"
"{json_schema}"
),
),
ChatMessage(
role="user",
content=(
"Here is the transcript: \n"
"------\n"
"{transcript}\n"
"------"
),
),
]
)
prompt = ChatPromptTemplate( message_templates=[ ChatMessage( role="system", content=( "You are an expert assitant for summarizing and extracting insights from sales call transcripts.\n" "Generate a valid JSON following the given schema below:\n" "{json_schema}" ), ), ChatMessage( role="user", content=( "Here is the transcript: \n" "------\n" "{transcript}\n" "------" ), ), ] )
In [ ]
已复制!
messages = prompt.format_messages(
json_schema=CallSummary.schema_json(), transcript=transcript
)
messages = prompt.format_messages( json_schema=CallSummary.schema_json(), transcript=transcript )
In [ ]
已复制!
output = llm.chat(
messages, response_format={"type": "json_object"}
).message.content
output = llm.chat( messages, response_format={"type": "json_object"} ).message.content
我们得到了一个有效的 JSON,但这只是重复了我们指定的 schema,并没有实际执行提取。
In [ ]
已复制!
print(output)
print(output)
{ "title": "CallSummary", "description": "Data model for a call summary.", "type": "object", "properties": { "summary": { "title": "Summary", "description": "High-level summary of the call transcript. Should not exceed 3 sentences.", "type": "string" }, "products": { "title": "Products", "description": "List of products discussed in the call", "type": "array", "items": { "type": "string" } }, "rep_name": { "title": "Rep Name", "description": "Name of the sales rep", "type": "string" }, "prospect_name": { "title": "Prospect Name", "description": "Name of the prospect", "type": "string" }, "action_items": { "title": "Action Items", "description": "List of action items", "type": "array", "items": { "type": "string" } } }, "required": ["summary", "products", "rep_name", "prospect_name", "action_items"] }
让我们通过直接展示想要的 JSON 格式而不是指定 schema 再试一次
In [ ]
已复制!
import json
prompt = ChatPromptTemplate(
message_templates=[
ChatMessage(
role="system",
content=(
"You are an expert assitant for summarizing and extracting insights from sales call transcripts.\n"
"Generate a valid JSON in the following format:\n"
"{json_example}"
),
),
ChatMessage(
role="user",
content=(
"Here is the transcript: \n"
"------\n"
"{transcript}\n"
"------"
),
),
]
)
dict_example = {
"summary": "High-level summary of the call transcript. Should not exceed 3 sentences.",
"products": ["product 1", "product 2"],
"rep_name": "Name of the sales rep",
"prospect_name": "Name of the prospect",
"action_items": ["action item 1", "action item 2"],
}
json_example = json.dumps(dict_example)
import json prompt = ChatPromptTemplate( message_templates=[ ChatMessage( role="system", content=( "You are an expert assitant for summarizing and extracting insights from sales call transcripts.\n" "Generate a valid JSON in the following format:\n" "{json_example}" ), ), ChatMessage( role="user", content=( "Here is the transcript: \n" "------\n" "{transcript}\n" "------" ), ), ] ) dict_example = { "summary": "High-level summary of the call transcript. Should not exceed 3 sentences.", "products": ["product 1", "product 2"], "rep_name": "Name of the sales rep", "prospect_name": "Name of the prospect", "action_items": ["action item 1", "action item 2"], } json_example = json.dumps(dict_example)
In [ ]
已复制!
messages = prompt.format_messages(
json_example=json_example, transcript=transcript
)
messages = prompt.format_messages( json_example=json_example, transcript=transcript )
In [ ]
已复制!
output = llm.chat(
messages, response_format={"type": "json_object"}
).message.content
output = llm.chat( messages, response_format={"type": "json_object"} ).message.content
现在我们能够像预期的那样获取提取的结构化数据了。
In [ ]
已复制!
print(output)
print(output)
{ "summary": "Sarah from XYZ Company called John to discuss the new product, the XYZ Widget, which is designed to streamline workflow and improve productivity. They discussed case studies and scheduling a demo for John and his team. The next steps include Sarah sending over information and following up to schedule the demo.", "products": ["XYZ Widget"], "rep_name": "Sarah", "prospect_name": "John", "action_items": ["Review case studies", "Schedule demo"] }
快速总结¶
- 函数调用对于结构化数据提取仍然更易于使用(特别是如果你已经将 schema 指定为 Pydantic 模型等)。
- 虽然 JSON 模式强制输出格式为 JSON,但它无助于根据指定的 schema 进行验证。直接传入 schema 可能无法生成预期的 JSON,并且可能需要额外的仔细格式化和提示。