电子邮件数据提取¶
可以使用OpenAI函数从电子邮件中提取数据。这是使用LlamaIndex从非结构化内容中获取结构化数据的另一个示例。
本示例的主要目标是将原始电子邮件内容转换为易于解释的JSON格式,这体现了语言模型在数据提取中的实际应用。提取的结构化JSON数据随后可用于任何下游应用。
我们将使用如下图所示的示例电子邮件。这封电子邮件模仿了ARK Investment发送给其订阅者的典型日常通信。该示例电子邮件包含有关其交易所交易基金(ETF)下交易的详细信息。通过使用此特定示例,我们旨在展示如何有效地从实际电子邮件场景中提取和结构化复杂的财务数据,将其转换为易于理解的JSON格式。
在 [ ] 中
已复制!
%pip install llama-index-llms-openai
%pip install llama-index-readers-file
%pip install llama-index-program-openai
%pip install llama-index-llms-openai %pip install llama-index-readers-file %pip install llama-index-program-openai # LlamaIndex !pip install llama-index # To get text conents from .eml and .msg file !pip install "unstructured[msg]"
在 [ ] 中
已复制!
# LlamaIndex
!pip install llama-index
# To get text conents from .eml and .msg file
!pip install "unstructured[msg]"
# LlamaIndex !pip install llama-index # 为了从 .eml 和 .msg 文件中获取文本内容 !pip install "unstructured[msg]"
启用日志记录并设置OpenAI API密钥¶
在此步骤中,我们设置日志记录以监控程序执行并在需要时进行调试。我们还配置OpenAI API密钥,这对于使用OpenAI服务至关重要。请将"YOUR_KEY_HERE"替换为您实际的OpenAI API密钥。
在 [ ] 中
已复制!
import logging
import sys, json
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
import logging import sys, json logging.basicConfig(stream=sys.stdout, level=logging.INFO) logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout)) import os import openai # os.environ["OPENAI_API_KEY"] = "YOUR_KEY_HERE" openai.api_key = os.environ["OPENAI_API_KEY"]
在 [ ] 中
已复制!
import os
import openai
# os.environ["OPENAI_API_KEY"] = "YOUR_KEY_HERE"
openai.api_key = os.environ["OPENAI_API_KEY"]
import os import openai # os.environ["OPENAI_API_KEY"] = "YOUR_KEY_HERE" # 直接设置密钥的示例 openai.api_key = os.environ["OPENAI_API_KEY"]
设置预期的JSON输出定义(JSON Schema)¶
这里我们使用Pydantic库定义一个名为EmailData
的Python类。该类对我们希望从电子邮件中提取的数据结构进行建模,包括发件人、收件人、电子邮件日期和时间,以及包含在该ETF下交易的股票列表的etfs。
在 [ ] 中
已复制!
from pydantic import BaseModel, Field
from typing import List
class Instrument(BaseModel):
"""Datamodel for ticker trading details."""
direction: str = Field(description="ticker trading - Buy, Sell, Hold etc")
ticker: str = Field(
description="Stock Ticker. 1-4 character code. Example: AAPL, TSLS, MSFT, VZ"
)
company_name: str = Field(
description="Company name corresponding to ticker"
)
shares_traded: float = Field(description="Number of shares traded")
percent_of_etf: float = Field(description="Percentage of ETF")
class Etf(BaseModel):
"""ETF trading data model"""
etf_ticker: str = Field(
description="ETF Ticker code. Example: ARKK, FSPTX"
)
trade_date: str = Field(description="Date of trading")
stocks: List[Instrument] = Field(
description="List of instruments or shares traded under this etf"
)
class EmailData(BaseModel):
"""Data model for email extracted information."""
etfs: List[Etf] = Field(
description="List of ETFs described in email having list of shares traded under it"
)
trade_notification_date: str = Field(
description="Date of trade notification"
)
sender_email_id: str = Field(description="Email Id of the email sender.")
email_date_time: str = Field(description="Date and time of email")
from pydantic import BaseModel, Field from typing import List class Instrument(BaseModel): """股票交易详情的数据模型。""" direction: str = Field(description="ticker trading - Buy, Sell, Hold etc") ticker: str = Field( description="股票代码。1-4个字符的代码。示例:AAPL, TSLS, MSFT, VZ" ) company_name: str = Field( description="与股票代码对应的公司名称" ) shares_traded: float = Field(description="交易的股票数量") percent_of_etf: float = Field(description="占ETF的百分比") class Etf(BaseModel): """ETF交易数据模型""" etf_ticker: str = Field( description="ETF股票代码。示例:ARKK, FSPTX" ) trade_date: str = Field(description="交易日期") stocks: List[Instrument] = Field( description="在此ETF下交易的股票或工具列表" ) class EmailData(BaseModel): """电子邮件提取信息的数据模型。""" etfs: List[Etf] = Field( description="电子邮件中描述的ETF列表,包含其下交易的股票列表" ) trade_notification_date: str = Field( description="交易通知日期" ) sender_email_id: str = Field(description="电子邮件发送者的邮箱ID。") email_date_time: str = Field(description="电子邮件的日期和时间")
从.eml / .msg文件加载内容¶
在此步骤中,我们将使用来自llama-hub
的UnstructuredReader
加载.eml电子邮件文件或.msg Outlook文件的内容。然后将此文件的内容存储在一个变量中以供进一步处理。
在 [ ] 中
已复制!
# get donload_loader
from llama_index.core import download_loader
# get donload_loader from llama_index.core import download_loader
在 [ ] 中
已复制!
# Create a download loader
from llama_index.readers.file import UnstructuredReader
# Initialize the UnstructuredReader
loader = UnstructuredReader()
# For eml file
eml_documents = loader.load_data("../data/email/ark-trading-jan-12-2024.eml")
email_content = eml_documents[0].text
print("\n\n Email contents")
print(email_content)
# 创建下载加载器 from llama_index.readers.file import UnstructuredReader # 初始化UnstructuredReader loader = UnstructuredReader() # 对于eml文件 eml_documents = loader.load_data("../data/email/ark-trading-jan-12-2024.eml") email_content = eml_documents[0].text print("\n\n Email contents") print(email_content)
在 [ ] 中
已复制!
# For Outlook msg
msg_documents = loader.load_data("../data/email/ark-trading-jan-12-2024.msg")
msg_content = msg_documents[0].text
print("\n\n Outlook contents")
print(msg_content)
# 对于Outlook msg msg_documents = loader.load_data("../data/email/ark-trading-jan-12-2024.msg") msg_content = msg_documents[0].text print("\n\n Outlook contents") print(msg_content)
使用LLM函数提取JSON格式的内容¶
在最后一步,我们利用llama_index
包创建一个提示模板,用于从已加载的电子邮件中提取见解。使用OpenAI
模型的实例来解释电子邮件内容,并根据我们预定义的EmailData
模式提取相关信息。然后将输出转换为字典格式,以便于查看和处理。
在 [ ] 中
已复制!
from llama_index.program.openai import OpenAIPydanticProgram
from llama_index.core import ChatPromptTemplate
from llama_index.core.llms import ChatMessage
from llama_index.llms.openai import OpenAI
from llama_index.program.openai import OpenAIPydanticProgram from llama_index.core import ChatPromptTemplate from llama_index.core.llms import ChatMessage from llama_index.llms.openai import OpenAI prompt = ChatPromptTemplate( message_templates=[ ChatMessage( role="system", content=( "你是一个专家助手,用于以JSON格式从电子邮件中提取见解。\n" "你根据提供的JSON模式从给定的电子邮件消息中提取数据并以JSON格式返回。\n" "记住只返回从提供的电子邮件消息中提取的数据。" ), ), ChatMessage( role="user", content=( "电子邮件消息:\n" "------\n" "{email_msg_content}\n" "------" ), ), ] ) llm = OpenAI(model="gpt-3.5-turbo-1106") program = OpenAIPydanticProgram.from_defaults( output_cls=EmailData, llm=llm, prompt=prompt, verbose=True, )
在 [ ] 中
已复制!
prompt = ChatPromptTemplate(
message_templates=[
ChatMessage(
role="system",
content=(
"You are an expert assitant for extracting insights from email in JSON format. \n"
"You extract data and returns it in JSON format, according to provided JSON schema, from given email message. \n"
"REMEMBER to return extracted data only from provided email message."
),
),
ChatMessage(
role="user",
content=(
"Email Message: \n" "------\n" "{email_msg_content}\n" "------"
),
),
]
)
llm = OpenAI(model="gpt-3.5-turbo-1106")
program = OpenAIPydanticProgram.from_defaults(
output_cls=EmailData,
llm=llm,
prompt=prompt,
verbose=True,
)
prompt = ChatPromptTemplate( message_templates=[ ChatMessage( role="system", content=( "你是一个专家助手,负责以 JSON 格式从电子邮件中提取洞察。 \n" "你根据提供的 JSON 模式,从给定的电子邮件消息中提取数据并以 JSON 格式返回。 \n" "切记,只从提供的电子邮件消息中返回提取的数据。" ), ), ChatMessage( role="user", content=( "电子邮件消息: \n" "------\n" "{email_msg_content}\n" "------" ), ), ] ) llm = OpenAI(model="gpt-3.5-turbo-1106") program = OpenAIPydanticProgram.from_defaults( output_cls=EmailData, llm=llm, prompt=prompt, verbose=True, )
在 [ ] 中
已复制!
output = program(email_msg_content=email_content)
print("Output JSON From .eml File: ")
print(json.dumps(output.dict(), indent=2))
output = program(email_msg_content=email_content) print("从.eml文件输出的JSON:") print(json.dumps(output.dict(), indent=2))
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK" HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK" HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK" Function call: EmailData with args: {"etfs":[{"etf_ticker":"ARKK","trade_date":"1/12/2024","stocks":[{"direction":"Buy","ticker":"TSLA","company_name":"TESLA INC","shares_traded":93654,"percent_of_etf":0.2453},{"direction":"Buy","ticker":"TXG","company_name":"10X GENOMICS INC","shares_traded":159506,"percent_of_etf":0.0907},{"direction":"Buy","ticker":"CRSP","company_name":"CRISPR THERAPEUTICS AG","shares_traded":86268,"percent_of_etf":0.0669},{"direction":"Buy","ticker":"RXRX","company_name":"RECURSION PHARMACEUTICALS","shares_traded":289619,"percent_of_etf":0.0391},{"direction":"Sell","ticker":"HOOD","company_name":"ROBINHOOD MARKETS INC","shares_traded":927,"percent_of_etf":0.0001},{"direction":"Sell","ticker":"EXAS","company_name":"EXACT SCIENCES CORP","shares_traded":100766,"percent_of_etf":0.0829},{"direction":"Sell","ticker":"TWLO","company_name":"TWILIO INC","shares_traded":108523,"percent_of_etf":0.0957},{"direction":"Sell","ticker":"PD","company_name":"PAGERDUTY INC","shares_traded":302096,"percent_of_etf":0.0958},{"direction":"Sell","ticker":"PATH","company_name":"UIPATH INC","shares_traded":553172,"percent_of_etf":0.1476}],"trade_date":"1/12/2024"},{"etf_ticker":"ARKW","trade_date":"1/12/2024","stocks":[{"direction":"Buy","ticker":"TSLA","company_name":"TESLA INC","shares_traded":18148,"percent_of_etf":0.2454},{"direction":"Sell","ticker":"HOOD","company_name":"ROBINHOOD MARKETS INC","shares_traded":49,"percent_of_etf":0.0000},{"direction":"Sell","ticker":"PD","company_name":"PAGERDUTY INC","shares_traded":9756,"percent_of_etf":0.016},{"direction":"Sell","ticker":"TWLO","company_name":"TWILIO INC","shares_traded":21849,"percent_of_etf":0.0994},{"direction":"Sell","ticker":"PATH","company_name":"UIPATH INC","shares_traded":105944,"percent_of_etf":0.1459}],"trade_date":"1/12/2024"},{"etf_ticker":"ARKG","trade_date":"1/12/2024","stocks":[{"direction":"Buy","ticker":"TXG","company_name":"10X GENOMICS INC","shares_traded":38042,"percent_of_etf":0.0864},{"direction":"Buy","ticker":"CRSP","company_name":"CRISPR THERAPEUTICS AG","shares_traded":21197,"percent_of_etf":0.0656},{"direction":"Buy","ticker":"RXRX","company_name":"RECURSION PHARMACEUTICALS","shares_traded":67422,"percent_of_etf":0.0363},{"direction":"Buy","ticker":"RPTX","company_name":"REPARE THERAPEUTICS INC","shares_traded":15410,"percent_of_etf":0.0049},{"direction":"Sell","ticker":"EXAS","company_name":"EXACT SCIENCES CORP","shares_traded":32057,"percent_of_etf":0.1052}],"trade_date":"1/12/2024"}],"trade_notification_date":"1/12/2024","sender_email_id":"[email protected]","email_date_time":"1/12/2024"} Output JSON From .eml File: { "etfs": [ { "etf_ticker": "ARKK", "trade_date": "1/12/2024", "stocks": [ { "direction": "Buy", "ticker": "TSLA", "company_name": "TESLA INC", "shares_traded": 93654.0, "percent_of_etf": 0.2453 }, { "direction": "Buy", "ticker": "TXG", "company_name": "10X GENOMICS INC", "shares_traded": 159506.0, "percent_of_etf": 0.0907 }, { "direction": "Buy", "ticker": "CRSP", "company_name": "CRISPR THERAPEUTICS AG", "shares_traded": 86268.0, "percent_of_etf": 0.0669 }, { "direction": "Buy", "ticker": "RXRX", "company_name": "RECURSION PHARMACEUTICALS", "shares_traded": 289619.0, "percent_of_etf": 0.0391 }, { "direction": "Sell", "ticker": "HOOD", "company_name": "ROBINHOOD MARKETS INC", "shares_traded": 927.0, "percent_of_etf": 0.0001 }, { "direction": "Sell", "ticker": "EXAS", "company_name": "EXACT SCIENCES CORP", "shares_traded": 100766.0, "percent_of_etf": 0.0829 }, { "direction": "Sell", "ticker": "TWLO", "company_name": "TWILIO INC", "shares_traded": 108523.0, "percent_of_etf": 0.0957 }, { "direction": "Sell", "ticker": "PD", "company_name": "PAGERDUTY INC", "shares_traded": 302096.0, "percent_of_etf": 0.0958 }, { "direction": "Sell", "ticker": "PATH", "company_name": "UIPATH INC", "shares_traded": 553172.0, "percent_of_etf": 0.1476 } ] }, { "etf_ticker": "ARKW", "trade_date": "1/12/2024", "stocks": [ { "direction": "Buy", "ticker": "TSLA", "company_name": "TESLA INC", "shares_traded": 18148.0, "percent_of_etf": 0.2454 }, { "direction": "Sell", "ticker": "HOOD", "company_name": "ROBINHOOD MARKETS INC", "shares_traded": 49.0, "percent_of_etf": 0.0 }, { "direction": "Sell", "ticker": "PD", "company_name": "PAGERDUTY INC", "shares_traded": 9756.0, "percent_of_etf": 0.016 }, { "direction": "Sell", "ticker": "TWLO", "company_name": "TWILIO INC", "shares_traded": 21849.0, "percent_of_etf": 0.0994 }, { "direction": "Sell", "ticker": "PATH", "company_name": "UIPATH INC", "shares_traded": 105944.0, "percent_of_etf": 0.1459 } ] }, { "etf_ticker": "ARKG", "trade_date": "1/12/2024", "stocks": [ { "direction": "Buy", "ticker": "TXG", "company_name": "10X GENOMICS INC", "shares_traded": 38042.0, "percent_of_etf": 0.0864 }, { "direction": "Buy", "ticker": "CRSP", "company_name": "CRISPR THERAPEUTICS AG", "shares_traded": 21197.0, "percent_of_etf": 0.0656 }, { "direction": "Buy", "ticker": "RXRX", "company_name": "RECURSION PHARMACEUTICALS", "shares_traded": 67422.0, "percent_of_etf": 0.0363 }, { "direction": "Buy", "ticker": "RPTX", "company_name": "REPARE THERAPEUTICS INC", "shares_traded": 15410.0, "percent_of_etf": 0.0049 }, { "direction": "Sell", "ticker": "EXAS", "company_name": "EXACT SCIENCES CORP", "shares_traded": 32057.0, "percent_of_etf": 0.1052 } ] } ], "trade_notification_date": "1/12/2024", "sender_email_id": "[email protected]", "email_date_time": "1/12/2024" }
对于outlook消息¶
在 [ ] 中
已复制!
output = program(email_msg_content=msg_content)
print("Output JSON from .msg file: ")
print(json.dumps(output.dict(), indent=2))
output = program(email_msg_content=msg_content) print("从.msg文件输出的JSON:") print(json.dumps(output.dict(), indent=2))
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK" HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK" Function call: EmailData with args: {"etfs":[{"etf_ticker":"ARKK","trade_date":"1/12/2024","stocks":[{"direction":"Buy","ticker":"TSLA","company_name":"TESLA INC","shares_traded":93654,"percent_of_etf":0.2453},{"direction":"Buy","ticker":"TXG","company_name":"10X GENOMICS INC","shares_traded":159506,"percent_of_etf":0.0907},{"direction":"Buy","ticker":"CRSP","company_name":"CRISPR THERAPEUTICS AG","shares_traded":86268,"percent_of_etf":0.0669},{"direction":"Buy","ticker":"RXRX","company_name":"RECURSION PHARMACEUTICALS","shares_traded":289619,"percent_of_etf":0.0391},{"direction":"Sell","ticker":"HOOD","company_name":"ROBINHOOD MARKETS INC","shares_traded":927,"percent_of_etf":0.0001},{"direction":"Sell","ticker":"EXAS","company_name":"EXACT SCIENCES CORP","shares_traded":100766,"percent_of_etf":0.0829},{"direction":"Sell","ticker":"TWLO","company_name":"TWILIO INC","shares_traded":108523,"percent_of_etf":0.0957},{"direction":"Sell","ticker":"PD","company_name":"PAGERDUTY INC","shares_traded":302096,"percent_of_etf":0.0958},{"direction":"Sell","ticker":"PATH","company_name":"UIPATH INC","shares_traded":553172,"percent_of_etf":0.1476}]},{"etf_ticker":"ARKW","trade_date":"1/12/2024","stocks":[{"direction":"Buy","ticker":"TSLA","company_name":"TESLA INC","shares_traded":18148,"percent_of_etf":0.2454},{"direction":"Sell","ticker":"HOOD","company_name":"ROBINHOOD MARKETS INC","shares_traded":49,"percent_of_etf":0.0000},{"direction":"Sell","ticker":"PD","company_name":"PAGERDUTY INC","shares_traded":9756,"percent_of_etf":0.0160},{"direction":"Sell","ticker":"TWLO","company_name":"TWILIO INC","shares_traded":21849,"percent_of_etf":0.0994},{"direction":"Sell","ticker":"PATH","company_name":"UIPATH INC","shares_traded":105944,"percent_of_etf":0.1459}]},{"etf_ticker":"ARKG","trade_date":"1/12/2024","stocks":[{"direction":"Buy","ticker":"TXG","company_name":"10X GENOMICS INC","shares_traded":38042,"percent_of_etf":0.0864},{"direction":"Buy","ticker":"CRSP","company_name":"CRISPR THERAPEUTICS AG","shares_traded":21197,"percent_of_etf":0.0656},{"direction":"Buy","ticker":"RXRX","company_name":"RECURSION PHARMACEUTICALS","shares_traded":67422,"percent_of_etf":0.0363},{"direction":"Buy","ticker":"RPTX","company_name":"REPARE THERAPEUTICS INC","shares_traded":15410,"percent_of_etf":0.0049},{"direction":"Sell","ticker":"EXAS","company_name":"EXACT SCIENCES CORP","shares_traded":32057,"percent_of_etf":0.1052}]}],"trade_notification_date":"1/12/2024","sender_email_id":"ark-invest.com","email_date_time":"1/12/2024"} Output JSON : { "etfs": [ { "etf_ticker": "ARKK", "trade_date": "1/12/2024", "stocks": [ { "direction": "Buy", "ticker": "TSLA", "company_name": "TESLA INC", "shares_traded": 93654.0, "percent_of_etf": 0.2453 }, { "direction": "Buy", "ticker": "TXG", "company_name": "10X GENOMICS INC", "shares_traded": 159506.0, "percent_of_etf": 0.0907 }, { "direction": "Buy", "ticker": "CRSP", "company_name": "CRISPR THERAPEUTICS AG", "shares_traded": 86268.0, "percent_of_etf": 0.0669 }, { "direction": "Buy", "ticker": "RXRX", "company_name": "RECURSION PHARMACEUTICALS", "shares_traded": 289619.0, "percent_of_etf": 0.0391 }, { "direction": "Sell", "ticker": "HOOD", "company_name": "ROBINHOOD MARKETS INC", "shares_traded": 927.0, "percent_of_etf": 0.0001 }, { "direction": "Sell", "ticker": "EXAS", "company_name": "EXACT SCIENCES CORP", "shares_traded": 100766.0, "percent_of_etf": 0.0829 }, { "direction": "Sell", "ticker": "TWLO", "company_name": "TWILIO INC", "shares_traded": 108523.0, "percent_of_etf": 0.0957 }, { "direction": "Sell", "ticker": "PD", "company_name": "PAGERDUTY INC", "shares_traded": 302096.0, "percent_of_etf": 0.0958 }, { "direction": "Sell", "ticker": "PATH", "company_name": "UIPATH INC", "shares_traded": 553172.0, "percent_of_etf": 0.1476 } ] }, { "etf_ticker": "ARKW", "trade_date": "1/12/2024", "stocks": [ { "direction": "Buy", "ticker": "TSLA", "company_name": "TESLA INC", "shares_traded": 18148.0, "percent_of_etf": 0.2454 }, { "direction": "Sell", "ticker": "HOOD", "company_name": "ROBINHOOD MARKETS INC", "shares_traded": 49.0, "percent_of_etf": 0.0 }, { "direction": "Sell", "ticker": "PD", "company_name": "PAGERDUTY INC", "shares_traded": 9756.0, "percent_of_etf": 0.016 }, { "direction": "Sell", "ticker": "TWLO", "company_name": "TWILIO INC", "shares_traded": 21849.0, "percent_of_etf": 0.0994 }, { "direction": "Sell", "ticker": "PATH", "company_name": "UIPATH INC", "shares_traded": 105944.0, "percent_of_etf": 0.1459 } ] }, { "etf_ticker": "ARKG", "trade_date": "1/12/2024", "stocks": [ { "direction": "Buy", "ticker": "TXG", "company_name": "10X GENOMICS INC", "shares_traded": 38042.0, "percent_of_etf": 0.0864 }, { "direction": "Buy", "ticker": "CRSP", "company_name": "CRISPR THERAPEUTICS AG", "shares_traded": 21197.0, "percent_of_etf": 0.0656 }, { "direction": "Buy", "ticker": "RXRX", "company_name": "RECURSION PHARMACEUTICALS", "shares_traded": 67422.0, "percent_of_etf": 0.0363 }, { "direction": "Buy", "ticker": "RPTX", "company_name": "REPARE THERAPEUTICS INC", "shares_traded": 15410.0, "percent_of_etf": 0.0049 }, { "direction": "Sell", "ticker": "EXAS", "company_name": "EXACT SCIENCES CORP", "shares_traded": 32057.0, "percent_of_etf": 0.1052 } ] } ], "trade_notification_date": "1/12/2024", "sender_email_id": "ark-invest.com", "email_date_time": "1/12/2024" }