多模态结构化输出:GPT-4o 与其他 GPT-4 变体对比¶
在本 Notebook 中,我们使用 `MultiModalLLMCompletionProgram` 类通过图像执行结构化数据提取。我们将比较不同的具备视觉能力的 GPT-4 模型。
%pip install llama-index-llms-openai -q
%pip install llama-index-multi-modal-llms-openai -q
%pip install llama-index-readers-file -q
%pip install -U llama-index-core -q
from PIL import Image
import matplotlib.pyplot as plt
import pandas as pd
图像数据集:PaperCards¶
对于此数据提取任务,我们将使用多模态 LLM 从所谓的 PaperCards 中提取信息。这些是包含研究论文摘要的可视化图表。可以通过执行以下命令从我们的 Dropbox 账户下载数据集。
下载图片¶
!mkdir data
!wget "https://www.dropbox.com/scl/fo/jlxavjjzddcv6owvr9e6y/AJoNd0T2pUSeynOTtM_f60c?rlkey=4mvwc1r6lowmy7zqpnm1ikd24&st=1cs1gs9c&dl=1" -O data/paper_cards.zip
!unzip data/paper_cards.zip -d data
!rm data/paper_cards.zip
将 PaperCards 加载为 ImageDocuments¶
## import json
from llama_index.core.multi_modal_llms.generic_utils import load_image_urls
from llama_index.core import SimpleDirectoryReader, Document
# context images
image_path = "./data"
image_documents = SimpleDirectoryReader(image_path).load_data()
# let's see one
img_doc = image_documents[0]
image = Image.open(img_doc.image_path).convert("RGB")
plt.figure(figsize=(8, 8))
plt.axis("off")
plt.imshow(image)
plt.show()
构建我们的 MultiModalLLMCompletionProgram(多模态结构化输出)¶
期望的结构化输出¶
在这里,我们将定义我们的数据类(即 Pydantic BaseModel),它将用于存储从给定图像或 PaperCard 中提取的数据。
from llama_index.core.program import MultiModalLLMCompletionProgram
from llama_index.multi_modal_llms.openai import OpenAIMultiModal
from llama_index.core.bridge.pydantic import BaseModel, Field
from typing import List, Optional
# Desired output structure
class PaperCard(BaseModel):
"""Data class for storing text attributes of a PaperCard."""
title: str = Field(description="Title of paper.")
year: str = Field(description="Year of publication of paper.")
authors: str = Field(description="Authors of paper.")
arxiv_id: str = Field(description="Arxiv paper id.")
main_contribution: str = Field(
description="Main contribution of the paper."
)
insights: str = Field(
description="Main insight or motivation for the paper."
)
main_results: List[str] = Field(
description="The main results of the paper."
)
tech_bits: Optional[str] = Field(
description="Describe what's being displayed in the technical bits section of the image."
)
接下来,我们定义 `MultiModalLLMCompletionProgram`。这里我们将定义三个独立的程序,分别对应每个具备视觉能力的 GPT-4 模型,即:GPT-4o、GPT-4v 和 GPT-4Turbo。
paper_card_extraction_prompt = """
Use the attached PaperCard image to extract data from it and store into the
provided data class.
"""
gpt_4o = OpenAIMultiModal(model="gpt-4o", max_new_tokens=4096)
gpt_4v = OpenAIMultiModal(model="gpt-4-vision-preview", max_new_tokens=4096)
gpt_4turbo = OpenAIMultiModal(
model="gpt-4-turbo-2024-04-09", max_new_tokens=4096
)
multimodal_llms = {
"gpt_4o": gpt_4o,
"gpt_4v": gpt_4v,
"gpt_4turbo": gpt_4turbo,
}
programs = {
mdl_name: MultiModalLLMCompletionProgram.from_defaults(
output_cls=PaperCard,
prompt_template_str=paper_card_extraction_prompt,
multi_modal_llm=mdl,
)
for mdl_name, mdl in multimodal_llms.items()
}
让我们进行一次测试运行¶
# Please ensure you're using llama-index-core v0.10.37
papercard = programs["gpt_4o"](image_documents=[image_documents[0]])
papercard
PaperCard(title='CRITIC: LLMs Can Self-Correct With Tool-Interactive Critiquing', year='2023', authors='Gao, Zhibin et al.', arxiv_id='arXiv:2305.11738', main_contribution='A framework for verifying and then correcting hallucinations by large language models (LLMs) with external tools (e.g., text-to-text APIs).', insights='LLMs can hallucinate and produce false information. By using external tools, these hallucinations can be identified and corrected.', main_results=['CRITIC leads to marked improvements over baselines on QA, math, and toxicity reduction tasks.', 'Feedback from external tools is crucial for an LLM to self-correct.', 'CRITIC significantly outperforms baselines on QA, math, and toxicity reduction tasks.'], tech_bits='The technical bits section describes the CRITIC prompt, which includes an initial output, critique, and revision steps. It also highlights the tools used for critiquing, such as a calculator for math tasks and a toxicity classifier for toxicity reduction tasks.')
运行数据提取任务¶
现在我们已经测试了程序,准备将这些程序应用于 PaperCards 的数据提取任务!
import time
import tqdm
results = {}
for mdl_name, program in programs.items():
print(f"Model: {mdl_name}")
results[mdl_name] = {
"papercards": [],
"failures": [],
"execution_times": [],
"image_paths": [],
}
total_time = 0
for img in tqdm.tqdm(image_documents):
results[mdl_name]["image_paths"].append(img.image_path)
start_time = time.time()
try:
structured_output = program(image_documents=[img])
end_time = time.time() - start_time
results[mdl_name]["papercards"].append(structured_output)
results[mdl_name]["execution_times"].append(end_time)
results[mdl_name]["failures"].append(None)
except Exception as e:
results[mdl_name]["papercards"].append(None)
results[mdl_name]["execution_times"].append(None)
results[mdl_name]["failures"].append(e)
print()
Model: gpt_4o
100%|█████████████████████████████████████████████████████████████████████████████████████| 35/35 [09:01<00:00, 15.46s/it]
Model: gpt_4v
100%|█████████████████████████████████████████████████████████████████████████████████████| 35/35 [17:29<00:00, 29.99s/it]
Model: gpt_4turbo
100%|█████████████████████████████████████████████████████████████████████████████████████| 35/35 [14:50<00:00, 25.44s/it]
定量分析¶
在这里,我们将对各种程序进行快速定量分析。具体来说,我们将比较总失败次数、成功数据提取任务的总执行时间以及平均执行时间。
import numpy as np
import pandas as pd
metrics = {
"gpt_4o": {},
"gpt_4v": {},
"gpt_4turbo": {},
}
# error count
for mdl_name, mdl_results in results.items():
metrics[mdl_name]["error_count"] = sum(
el is not None for el in mdl_results["failures"]
)
metrics[mdl_name]["total_execution_time"] = sum(
el for el in mdl_results["execution_times"] if el is not None
)
metrics[mdl_name]["average_execution_time"] = metrics[mdl_name][
"total_execution_time"
] / (len(image_documents) - metrics[mdl_name]["error_count"])
metrics[mdl_name]["median_execution_time"] = np.percentile(
[el for el in mdl_results["execution_times"] if el is not None], q=0.5
)
pd.DataFrame(metrics)
gpt_4o | gpt_4v | gpt_4turbo | |
---|---|---|---|
错误计数 | 0.000000 | 14.000000 | 1.000000 |
总执行时间 | 541.128802 | 586.500559 | 762.130032 |
平均执行时间 | 15.460823 | 27.928598 | 22.415589 |
中位执行时间 | 5.377015 | 11.879649 | 7.177287 |
GPT-4o 确实更快!¶
- GPT-4o 在总执行时间(此处不计算失败的提取任务)以及平均和中位执行时间方面都明显更快。
- GPT-4o 不仅更快,而且能够成功提取所有 PaperCards 的数据。相比之下,GPT-4v 失败了 14 次,而 GPT-4turbo 失败了 1 次。
定性分析¶
在最后这一部分,我们将对提取结果进行定性分析。最终,我们将得到一个包含人工评估数据提取任务结果的“已标注”数据集。接下来提供的工具将允许您对每个 PaperCard 数据提取任务的三种程序(或模型)的结果进行手动评估。您作为标注员的工作是根据程序结果的完美程度从 0 到 5 进行评分,其中 5 分表示完美的数据提取。
from IPython.display import clear_output
def display_results_and_papercard(ix: int):
# image
image_path = results["gpt_4o"]["image_paths"][ix]
# outputs
gpt_4o_output = results["gpt_4o"]["papercards"][ix]
gpt_4v_output = results["gpt_4v"]["papercards"][ix]
gpt_4turbo_output = results["gpt_4turbo"]["papercards"][ix]
image = Image.open(image_path).convert("RGB")
plt.figure(figsize=(10, 10))
plt.axis("off")
plt.imshow(image)
plt.show()
print("GPT-4o\n")
if gpt_4o_output is not None:
print(json.dumps(gpt_4o_output.dict(), indent=4))
else:
print("Failed to extract data")
print()
print("============================================\n")
print("GPT-4v\n")
if gpt_4v_output is not None:
print(json.dumps(gpt_4v_output.dict(), indent=4))
else:
print("Failed to extract data")
print()
print("============================================\n")
print("GPT-4turbo\n")
if gpt_4turbo_output is not None:
print(json.dumps(gpt_4turbo_output.dict(), indent=4))
else:
print("Failed to extract data")
print()
print("============================================\n")
GRADES = {
"gpt_4o": [0] * len(image_documents),
"gpt_4v": [0] * len(image_documents),
"gpt_4turbo": [0] * len(image_documents),
}
def manual_evaluation_single(img_ix: int):
"""Update the GRADES dictionary for a single PaperCard
data extraction task.
"""
display_results_and_papercard(img_ix)
gpt_4o_grade = input(
"Provide a rating from 0 to 5, with 5 being the highest for GPT-4o."
)
gpt_4v_grade = input(
"Provide a rating from 0 to 5, with 5 being the highest for GPT-4v."
)
gpt_4turbo_grade = input(
"Provide a rating from 0 to 5, with 5 being the highest for GPT-4turbo."
)
GRADES["gpt_4o"][img_ix] = gpt_4o_grade
GRADES["gpt_4v"][img_ix] = gpt_4v_grade
GRADES["gpt_4turbo"][img_ix] = gpt_4turbo_grade
def manual_evaluations(img_ix: Optional[int] = None):
"""An interactive program for manually grading gpt-4 variants on the
task of PaperCard data extraction.
"""
if img_ix is None:
# mark all results
for ix in range(len(image_documents)):
print(f"You are marking {ix + 1} out of {len(image_documents)}")
print()
manual_evaluation_single(ix)
clear_output(wait=True)
else:
manual_evaluation_single(img_ix)
manual_evaluations()
You are marking 35 out of 35
GPT-4o { "title": "Prometheus: Inducing Fine-Grained Evaluation Capability In Language Models", "year": "2023", "authors": "Kim, Seungone et al.", "arxiv_id": "arxiv:2310.08441", "main_contribution": "An open-source LLM (LLMav2) evaluation specializing in fine-grained evaluations using human-like rubrics.", "insights": "While large LLMs like GPT-4 have shown impressive performance, they still lack fine-grained evaluation capabilities. Prometheus aims to address this by providing a dataset and evaluation framework that can assess models on a more detailed level.", "main_results": [ "Prometheus matches or outperforms GPT-4.", "Prometheus can function as a reward model.", "Reference answers are crucial for fine-grained evaluation." ], "tech_bits": "Score Rubric, Feedback Collection, Generated Instructions, Generated Responses, Generated Rubrics, Evaluations, Answers & Explanations" } ============================================ GPT-4v { "title": "PROMETHEUS: Fine-Grained Evaluation Capability In Language Models", "year": "2023", "authors": "Kim, George, et al.", "arxiv_id": "arXiv:2310.08941", "main_contribution": "PROMETHEUS presents a novel source-level LLM evaluation suite using a custom feedback collection interface.", "insights": "The insights section would contain a summary of the main insight or motivation for the paper as described in the image.", "main_results": [ "The main results section would list the key findings or results of the paper as described in the image." ], "tech_bits": "The tech bits section would describe what's being displayed in the technical bits section of the image." } ============================================ GPT-4turbo { "title": "Prometheus: Evaluating Capability In Language Models", "year": "2023", "authors": "Kim, George, et al.", "arxiv_id": "arXiv:2310.05941", "main_contribution": "Prometheus uses a custom feedback collection system designed for fine-tuning language models.", "insights": "The main insight is that fine-tuning language models on specific tasks can improve their overall performance, especially when using a custom feedback collection system.", "main_results": [ "Prometheus LM outperforms GPT-4 on targeted feedback tasks.", "Prometheus LM's custom feedback function was 2% more effective than Prometheus 3.", "Feedback quality was better as reported by human judges." ], "tech_bits": "The technical bits section includes a Rubric Score, Seed, Fine-Grained Annotations, and Models. It also shows a feedback collection process with a visual representation of the feedback loop involving seed, generated annotations, and models." } ============================================
Provide a rating from 0 to 5, with 5 being the highest for GPT-4o. 3 Provide a rating from 0 to 5, with 5 being the highest for GPT-4v. 1.5 Provide a rating from 0 to 5, with 5 being the highest for GPT-4turbo. 1.5
grades_df = pd.DataFrame(GRADES, dtype=float)
grades_df.mean()
gpt_4o 3.585714 gpt_4v 1.300000 gpt_4turbo 2.128571 dtype: float64
观察表格¶
在下表中,我们列出了我们希望从 PaperCard 中提取的每个组件的总体观察结果。GPT-4v 和 GPT-4Turbo 的表现相似,GPT-4Turbo 略有优势。总的来说,GPT-4o 在此数据提取任务中的表现明显优于其他模型。最后,所有模型在描述 PaperCard 的 Tech Bits 部分时似乎都有些困难,有时所有模型都会生成摘要而不是精确提取;然而,GPT-4o 这样做的次数比其他模型少。
提取的组件 | GPT-4o | GPT-4v & GPT-4Turbo |
---|---|---|
标题、年份、作者 | 非常好,大概 100% | 大概 80%,在少数示例上出现幻觉 |
Arxiv ID | 良好,准确率约 95% | 准确率 70% |
主要贡献 | 良好 (~80%) 但无法提取列出的多个贡献 | 不太好,准确率 60%,有些幻觉 |
洞察 | 不太好 (~65%) 更多是生成摘要而非提取 | 更多是生成摘要而非提取 |
主要结果 | 非常擅长提取主要结果的摘要陈述 | 这里出现了大量幻觉 |
技术细节 | 无法在这里生成图表的详细描述 | 无法在这里生成图表的详细描述 |
总结¶
- GPT-4o 比 GPT-4v 和 GPT-4turbo 更快,并且失败次数更少(0 次!)
- GPT-4o 的数据提取结果优于 GPT-4v 和 GPT-4turbo
- GPT-4o 非常擅长从 PaperCard 中提取事实:标题、作者、年份以及主要结果部分的要点陈述
- GPT-4v 和 GPT-4turbo 经常对主要结果产生幻觉,有时也对作者产生幻觉
- 通过更好的提示,可以提高 GPT-4o 的结果,尤其是在提取 Insights 部分的数据以及描述 Tech Bits 时