多模态结构化输出：GPT-4o 与其他 GPT-4 变体对比¶

在本 Notebook 中，我们使用 `MultiModalLLMCompletionProgram` 类通过图像执行结构化数据提取。我们将比较不同的具备视觉能力的 GPT-4 模型。

In [ ]

已复制！

%pip install llama-index-llms-openai -q
%pip install llama-index-multi-modal-llms-openai -q
%pip install llama-index-readers-file -q
%pip install -U llama-index-core -q
%pip install llama-index-llms-openai -q %pip install llama-index-multi-modal-llms-openai -q %pip install llama-index-readers-file -q %pip install -U llama-index-core -q

In [ ]

已复制！

from PIL import Image
import matplotlib.pyplot as plt
import pandas as pd
from PIL import Image import matplotlib.pyplot as plt import pandas as pd

图像数据集：PaperCards¶

对于此数据提取任务，我们将使用多模态 LLM 从所谓的 PaperCards 中提取信息。这些是包含研究论文摘要的可视化图表。可以通过执行以下命令从我们的 Dropbox 账户下载数据集。

下载图片¶

In [ ]

已复制！

!mkdir data
!wget "https://www.dropbox.com/scl/fo/jlxavjjzddcv6owvr9e6y/AJoNd0T2pUSeynOTtM_f60c?rlkey=4mvwc1r6lowmy7zqpnm1ikd24&st=1cs1gs9c&dl=1" -O data/paper_cards.zip
!unzip data/paper_cards.zip -d data
!rm data/paper_cards.zip
!mkdir data !wget "https://www.dropbox.com/scl/fo/jlxavjjzddcv6owvr9e6y/AJoNd0T2pUSeynOTtM_f60c?rlkey=4mvwc1r6lowmy7zqpnm1ikd24&st=1cs1gs9c&dl=1" -O data/paper_cards.zip !unzip data/paper_cards.zip -d data !rm data/paper_cards.zip

将 PaperCards 加载为 ImageDocuments¶

In [ ]

已复制！

## import json
from llama_index.core.multi_modal_llms.generic_utils import load_image_urls
from llama_index.core import SimpleDirectoryReader, Document

# context images
image_path = "./data"
image_documents = SimpleDirectoryReader(image_path).load_data()
## import json from llama_index.core.multi_modal_llms.generic_utils import load_image_urls from llama_index.core import SimpleDirectoryReader, Document # context images image_path = "./data" image_documents = SimpleDirectoryReader(image_path).load_data()

In [ ]

已复制！





# let's see one
img_doc = image_documents[0]
image = Image.open(img_doc.image_path).convert("RGB")
plt.figure(figsize=(8, 8))
plt.axis("off")
plt.imshow(image)
plt.show()
# let's see one img_doc = image_documents[0] image = Image.open(img_doc.image_path).convert("RGB") plt.figure(figsize=(8, 8)) plt.axis("off") plt.imshow(image) plt.show()

No description has been provided for this image

构建我们的 MultiModalLLMCompletionProgram（多模态结构化输出）¶

期望的结构化输出¶

在这里，我们将定义我们的数据类（即 Pydantic BaseModel），它将用于存储从给定图像或 PaperCard 中提取的数据。

In [ ]

已复制！





from llama_index.core.program import MultiModalLLMCompletionProgram
from llama_index.multi_modal_llms.openai import OpenAIMultiModal
from llama_index.core.bridge.pydantic import BaseModel, Field
from typing import List, Optional


# Desired output structure
class PaperCard(BaseModel):
    """Data class for storing text attributes of a PaperCard."""

    title: str = Field(description="Title of paper.")
    year: str = Field(description="Year of publication of paper.")
    authors: str = Field(description="Authors of paper.")
    arxiv_id: str = Field(description="Arxiv paper id.")
    main_contribution: str = Field(
        description="Main contribution of the paper."
    )
    insights: str = Field(
        description="Main insight or motivation for the paper."
    )
    main_results: List[str] = Field(
        description="The main results of the paper."
    )
    tech_bits: Optional[str] = Field(
        description="Describe what's being displayed in the technical bits section of the image."
    )
from llama_index.core.program import MultiModalLLMCompletionProgram from llama_index.multi_modal_llms.openai import OpenAIMultiModal from llama_index.core.bridge.pydantic import BaseModel, Field from typing import List, Optional # Desired output structure class PaperCard(BaseModel): """Data class for storing text attributes of a PaperCard.""" title: str = Field(description="Title of paper.") year: str = Field(description="Year of publication of paper.") authors: str = Field(description="Authors of paper.") arxiv_id: str = Field(description="Arxiv paper id.") main_contribution: str = Field( description="Main contribution of the paper." ) insights: str = Field( description="Main insight or motivation for the paper." ) main_results: List[str] = Field( description="The main results of the paper." ) tech_bits: Optional[str] = Field( description="Describe what's being displayed in the technical bits section of the image." )

接下来，我们定义 `MultiModalLLMCompletionProgram`。这里我们将定义三个独立的程序，分别对应每个具备视觉能力的 GPT-4 模型，即：GPT-4o、GPT-4v 和 GPT-4Turbo。

In [ ]

已复制！





paper_card_extraction_prompt = """
Use the attached PaperCard image to extract data from it and store into the
provided data class.
"""

gpt_4o = OpenAIMultiModal(model="gpt-4o", max_new_tokens=4096)

gpt_4v = OpenAIMultiModal(model="gpt-4-vision-preview", max_new_tokens=4096)

gpt_4turbo = OpenAIMultiModal(
    model="gpt-4-turbo-2024-04-09", max_new_tokens=4096
)

multimodal_llms = {
    "gpt_4o": gpt_4o,
    "gpt_4v": gpt_4v,
    "gpt_4turbo": gpt_4turbo,
}

programs = {
    mdl_name: MultiModalLLMCompletionProgram.from_defaults(
        output_cls=PaperCard,
        prompt_template_str=paper_card_extraction_prompt,
        multi_modal_llm=mdl,
    )
    for mdl_name, mdl in multimodal_llms.items()
}
paper_card_extraction_prompt = """ Use the attached PaperCard image to extract data from it and store into the provided data class. """ gpt_4o = OpenAIMultiModal(model="gpt-4o", max_new_tokens=4096) gpt_4v = OpenAIMultiModal(model="gpt-4-vision-preview", max_new_tokens=4096) gpt_4turbo = OpenAIMultiModal( model="gpt-4-turbo-2024-04-09", max_new_tokens=4096 ) multimodal_llms = { "gpt_4o": gpt_4o, "gpt_4v": gpt_4v, "gpt_4turbo": gpt_4turbo, } programs = { mdl_name: MultiModalLLMCompletionProgram.from_defaults( output_cls=PaperCard, prompt_template_str=paper_card_extraction_prompt, multi_modal_llm=mdl, ) for mdl_name, mdl in multimodal_llms.items() }

让我们进行一次测试运行¶

In [ ]

已复制！

# Please ensure you're using llama-index-core v0.10.37
papercard = programs["gpt_4o"](image_documents=[image_documents[0]])
# 请确保您使用的是 llama-index-core v0.10.37 papercard = programs["gpt_4o"](image_documents=[image_documents[0]])

In [ ]

已复制！

papercard
papercard

Out [ ]

PaperCard(title='CRITIC: LLMs Can Self-Correct With Tool-Interactive Critiquing', year='2023', authors='Gao, Zhibin et al.', arxiv_id='arXiv:2305.11738', main_contribution='A framework for verifying and then correcting hallucinations by large language models (LLMs) with external tools (e.g., text-to-text APIs).', insights='LLMs can hallucinate and produce false information. By using external tools, these hallucinations can be identified and corrected.', main_results=['CRITIC leads to marked improvements over baselines on QA, math, and toxicity reduction tasks.', 'Feedback from external tools is crucial for an LLM to self-correct.', 'CRITIC significantly outperforms baselines on QA, math, and toxicity reduction tasks.'], tech_bits='The technical bits section describes the CRITIC prompt, which includes an initial output, critique, and revision steps. It also highlights the tools used for critiquing, such as a calculator for math tasks and a toxicity classifier for toxicity reduction tasks.')

运行数据提取任务¶

现在我们已经测试了程序，准备将这些程序应用于 PaperCards 的数据提取任务！

In [ ]

已复制！

import time
import tqdm
import time import tqdm

In [ ]

已复制！





results = {}

for mdl_name, program in programs.items():
    print(f"Model: {mdl_name}")
    results[mdl_name] = {
        "papercards": [],
        "failures": [],
        "execution_times": [],
        "image_paths": [],
    }
    total_time = 0
    for img in tqdm.tqdm(image_documents):
        results[mdl_name]["image_paths"].append(img.image_path)
        start_time = time.time()
        try:
            structured_output = program(image_documents=[img])
            end_time = time.time() - start_time
            results[mdl_name]["papercards"].append(structured_output)
            results[mdl_name]["execution_times"].append(end_time)
            results[mdl_name]["failures"].append(None)
        except Exception as e:
            results[mdl_name]["papercards"].append(None)
            results[mdl_name]["execution_times"].append(None)
            results[mdl_name]["failures"].append(e)
    print()
results = {} for mdl_name, program in programs.items(): print(f"模型：{mdl_name}") results[mdl_name] = { "papercards": [], "failures": [], "execution_times": [], "image_paths": [], } total_time = 0 for img in tqdm.tqdm(image_documents): results[mdl_name]["image_paths"].append(img.image_path) start_time = time.time() try: structured_output = program(image_documents=[img]) end_time = time.time() - start_time results[mdl_name]["papercards"].append(structured_output) results[mdl_name]["execution_times"].append(end_time) results[mdl_name]["failures"].append(None) except Exception as e: results[mdl_name]["papercards"].append(None) results[mdl_name]["execution_times"].append(None) results[mdl_name]["failures"].append(e) print()

Model: gpt_4o

100%|█████████████████████████████████████████████████████████████████████████████████████| 35/35 [09:01<00:00, 15.46s/it]

Model: gpt_4v

100%|█████████████████████████████████████████████████████████████████████████████████████| 35/35 [17:29<00:00, 29.99s/it]

Model: gpt_4turbo

100%|█████████████████████████████████████████████████████████████████████████████████████| 35/35 [14:50<00:00, 25.44s/it]

定量分析¶

在这里，我们将对各种程序进行快速定量分析。具体来说，我们将比较总失败次数、成功数据提取任务的总执行时间以及平均执行时间。

In [ ]

已复制！

import numpy as np
import pandas as pd
import numpy as np import pandas as pd

In [ ]

已复制！





metrics = {
    "gpt_4o": {},
    "gpt_4v": {},
    "gpt_4turbo": {},
}

# error count
for mdl_name, mdl_results in results.items():
    metrics[mdl_name]["error_count"] = sum(
        el is not None for el in mdl_results["failures"]
    )
    metrics[mdl_name]["total_execution_time"] = sum(
        el for el in mdl_results["execution_times"] if el is not None
    )
    metrics[mdl_name]["average_execution_time"] = metrics[mdl_name][
        "total_execution_time"
    ] / (len(image_documents) - metrics[mdl_name]["error_count"])
    metrics[mdl_name]["median_execution_time"] = np.percentile(
        [el for el in mdl_results["execution_times"] if el is not None], q=0.5
    )
metrics = { "gpt_4o": {}, "gpt_4v": {}, "gpt_4turbo": {}, } # 错误计数 for mdl_name, mdl_results in results.items(): metrics[mdl_name]["error_count"] = sum( el is not None for el in mdl_results["failures"] ) metrics[mdl_name]["total_execution_time"] = sum( el for el in mdl_results["execution_times"] if el is not None ) metrics[mdl_name]["average_execution_time"] = metrics[mdl_name][ "total_execution_time" ] / (len(image_documents) - metrics[mdl_name]["error_count"]) metrics[mdl_name]["median_execution_time"] = np.percentile( [el for el in mdl_results["execution_times"] if el is not None], q=0.5 )

In [ ]

已复制！

pd.DataFrame(metrics)
pd.DataFrame(metrics)

Out [ ]

	gpt_4o	gpt_4v	gpt_4turbo
错误计数	0.000000	14.000000	1.000000
总执行时间	541.128802	586.500559	762.130032
平均执行时间	15.460823	27.928598	22.415589
中位执行时间	5.377015	11.879649	7.177287

GPT-4o 确实更快！¶

GPT-4o 在总执行时间（此处不计算失败的提取任务）以及平均和中位执行时间方面都明显更快。
GPT-4o 不仅更快，而且能够成功提取所有 PaperCards 的数据。相比之下，GPT-4v 失败了 14 次，而 GPT-4turbo 失败了 1 次。

定性分析¶

在最后这一部分，我们将对提取结果进行定性分析。最终，我们将得到一个包含人工评估数据提取任务结果的“已标注”数据集。接下来提供的工具将允许您对每个 PaperCard 数据提取任务的三种程序（或模型）的结果进行手动评估。您作为标注员的工作是根据程序结果的完美程度从 0 到 5 进行评分，其中 5 分表示完美的数据提取。

In [ ]

已复制！

from IPython.display import clear_output
from IPython.display import clear_output

In [ ]

已复制！





def display_results_and_papercard(ix: int):
    # image
    image_path = results["gpt_4o"]["image_paths"][ix]

    # outputs
    gpt_4o_output = results["gpt_4o"]["papercards"][ix]
    gpt_4v_output = results["gpt_4v"]["papercards"][ix]
    gpt_4turbo_output = results["gpt_4turbo"]["papercards"][ix]

    image = Image.open(image_path).convert("RGB")
    plt.figure(figsize=(10, 10))
    plt.axis("off")
    plt.imshow(image)
    plt.show()

    print("GPT-4o\n")
    if gpt_4o_output is not None:
        print(json.dumps(gpt_4o_output.dict(), indent=4))
    else:
        print("Failed to extract data")
    print()
    print("============================================\n")

    print("GPT-4v\n")
    if gpt_4v_output is not None:
        print(json.dumps(gpt_4v_output.dict(), indent=4))
    else:
        print("Failed to extract data")
    print()
    print("============================================\n")

    print("GPT-4turbo\n")
    if gpt_4turbo_output is not None:
        print(json.dumps(gpt_4turbo_output.dict(), indent=4))
    else:
        print("Failed to extract data")
    print()
    print("============================================\n")
def display_results_and_papercard(ix: int): # 图片 image_path = results["gpt_4o"]["image_paths"][ix] # 输出 gpt_4o_output = results["gpt_4o"]["papercards"][ix] gpt_4v_output = results["gpt_4v"]["papercards"][ix] gpt_4turbo_output = results["gpt_4turbo"]["papercards"][ix] image = Image.open(image_path).convert("RGB") plt.figure(figsize=(10, 10)) plt.axis("off") plt.imshow(image) plt.show() print("GPT-4o\n") if gpt_4o_output is not None: print(json.dumps(gpt_4o_output.dict(), indent=4)) else: print("数据提取失败") print() print("============================================\n") print("GPT-4v\n") if gpt_4v_output is not None: print(json.dumps(gpt_4v_output.dict(), indent=4)) else: print("数据提取失败") print() print("============================================\n") print("GPT-4turbo\n") if gpt_4turbo_output is not None: print(json.dumps(gpt_4turbo_output.dict(), indent=4)) else: print("数据提取失败") print() print("============================================\n")

In [ ]

已复制！





GRADES = {
    "gpt_4o": [0] * len(image_documents),
    "gpt_4v": [0] * len(image_documents),
    "gpt_4turbo": [0] * len(image_documents),
}


def manual_evaluation_single(img_ix: int):
    """Update the GRADES dictionary for a single PaperCard
    data extraction task.
    """
    display_results_and_papercard(img_ix)

    gpt_4o_grade = input(
        "Provide a rating from 0 to 5, with 5 being the highest for GPT-4o."
    )
    gpt_4v_grade = input(
        "Provide a rating from 0 to 5, with 5 being the highest for GPT-4v."
    )
    gpt_4turbo_grade = input(
        "Provide a rating from 0 to 5, with 5 being the highest for GPT-4turbo."
    )

    GRADES["gpt_4o"][img_ix] = gpt_4o_grade
    GRADES["gpt_4v"][img_ix] = gpt_4v_grade
    GRADES["gpt_4turbo"][img_ix] = gpt_4turbo_grade


def manual_evaluations(img_ix: Optional[int] = None):
    """An interactive program for manually grading gpt-4 variants on the
    task of PaperCard data extraction.
    """
    if img_ix is None:
        # mark all results
        for ix in range(len(image_documents)):
            print(f"You are marking {ix + 1} out of {len(image_documents)}")
            print()
            manual_evaluation_single(ix)
            clear_output(wait=True)
    else:
        manual_evaluation_single(img_ix)
GRADES = { "gpt_4o": [0] * len(image_documents), "gpt_4v": [0] * len(image_documents), "gpt_4turbo": [0] * len(image_documents), } def manual_evaluation_single(img_ix: int): """更新单个 PaperCard 数据提取任务的 GRADES 字典。 """ display_results_and_papercard(img_ix) gpt_4o_grade = input( "请为 GPT-4o 提供 0 到 5 的评分，5 分最高。" ) gpt_4v_grade = input( "请为 GPT-4v 提供 0 到 5 的评分，5 分最高。" ) gpt_4turbo_grade = input( "请为 GPT-4turbo 提供 0 到 5 的评分，5 分最高。" ) GRADES["gpt_4o"][img_ix] = gpt_4o_grade GRADES["gpt_4v"][img_ix] = gpt_4v_grade GRADES["gpt_4turbo"][img_ix] = gpt_4turbo_grade def manual_evaluations(img_ix: Optional[int] = None): """一个交互式程序，用于手动评估 gpt-4 不同模型在 PaperCard 数据提取任务上的表现。 """ if img_ix is None: # 标记所有结果 for ix in range(len(image_documents)): print(f"您正在标记 {len(image_documents)} 中的第 {ix + 1} 个") print() manual_evaluation_single(ix) clear_output(wait=True) else: manual_evaluation_single(img_ix)

In [ ]

已复制！

manual_evaluations()
manual_evaluations()

You are marking 35 out of 35

GPT-4o

{
    "title": "Prometheus: Inducing Fine-Grained Evaluation Capability In Language Models",
    "year": "2023",
    "authors": "Kim, Seungone et al.",
    "arxiv_id": "arxiv:2310.08441",
    "main_contribution": "An open-source LLM (LLMav2) evaluation specializing in fine-grained evaluations using human-like rubrics.",
    "insights": "While large LLMs like GPT-4 have shown impressive performance, they still lack fine-grained evaluation capabilities. Prometheus aims to address this by providing a dataset and evaluation framework that can assess models on a more detailed level.",
    "main_results": [
        "Prometheus matches or outperforms GPT-4.",
        "Prometheus can function as a reward model.",
        "Reference answers are crucial for fine-grained evaluation."
    ],
    "tech_bits": "Score Rubric, Feedback Collection, Generated Instructions, Generated Responses, Generated Rubrics, Evaluations, Answers & Explanations"
}

============================================

GPT-4v

{
    "title": "PROMETHEUS: Fine-Grained Evaluation Capability In Language Models",
    "year": "2023",
    "authors": "Kim, George, et al.",
    "arxiv_id": "arXiv:2310.08941",
    "main_contribution": "PROMETHEUS presents a novel source-level LLM evaluation suite using a custom feedback collection interface.",
    "insights": "The insights section would contain a summary of the main insight or motivation for the paper as described in the image.",
    "main_results": [
        "The main results section would list the key findings or results of the paper as described in the image."
    ],
    "tech_bits": "The tech bits section would describe what's being displayed in the technical bits section of the image."
}

============================================

GPT-4turbo

{
    "title": "Prometheus: Evaluating Capability In Language Models",
    "year": "2023",
    "authors": "Kim, George, et al.",
    "arxiv_id": "arXiv:2310.05941",
    "main_contribution": "Prometheus uses a custom feedback collection system designed for fine-tuning language models.",
    "insights": "The main insight is that fine-tuning language models on specific tasks can improve their overall performance, especially when using a custom feedback collection system.",
    "main_results": [
        "Prometheus LM outperforms GPT-4 on targeted feedback tasks.",
        "Prometheus LM's custom feedback function was 2% more effective than Prometheus 3.",
        "Feedback quality was better as reported by human judges."
    ],
    "tech_bits": "The technical bits section includes a Rubric Score, Seed, Fine-Grained Annotations, and Models. It also shows a feedback collection process with a visual representation of the feedback loop involving seed, generated annotations, and models."
}

============================================

Provide a rating from 0 to 5, with 5 being the highest for GPT-4o. 3
Provide a rating from 0 to 5, with 5 being the highest for GPT-4v. 1.5
Provide a rating from 0 to 5, with 5 being the highest for GPT-4turbo. 1.5

In [ ]

已复制！

grades_df = pd.DataFrame(GRADES, dtype=float)
grades_df.mean()
grades_df = pd.DataFrame(GRADES, dtype=float) grades_df.mean()

Out [ ]

gpt_4o        3.585714
gpt_4v        1.300000
gpt_4turbo    2.128571
dtype: float64

观察表格¶

在下表中，我们列出了我们希望从 PaperCard 中提取的每个组件的总体观察结果。GPT-4v 和 GPT-4Turbo 的表现相似，GPT-4Turbo 略有优势。总的来说，GPT-4o 在此数据提取任务中的表现明显优于其他模型。最后，所有模型在描述 PaperCard 的 Tech Bits 部分时似乎都有些困难，有时所有模型都会生成摘要而不是精确提取；然而，GPT-4o 这样做的次数比其他模型少。

提取的组件	GPT-4o	GPT-4v & GPT-4Turbo
标题、年份、作者	非常好，大概 100%	大概 80%，在少数示例上出现幻觉
Arxiv ID	良好，准确率约 95%	准确率 70%
主要贡献	良好 (~80%) 但无法提取列出的多个贡献	不太好，准确率 60%，有些幻觉
洞察	不太好 (~65%) 更多是生成摘要而非提取	更多是生成摘要而非提取
主要结果	非常擅长提取主要结果的摘要陈述	这里出现了大量幻觉
技术细节	无法在这里生成图表的详细描述	无法在这里生成图表的详细描述

总结¶

GPT-4o 比 GPT-4v 和 GPT-4turbo 更快，并且失败次数更少（0 次！）
GPT-4o 的数据提取结果优于 GPT-4v 和 GPT-4turbo
GPT-4o 非常擅长从 PaperCard 中提取事实：标题、作者、年份以及主要结果部分的要点陈述
GPT-4v 和 GPT-4turbo 经常对主要结果产生幻觉，有时也对作者产生幻觉
通过更好的提示，可以提高 GPT-4o 的结果，尤其是在提取 Insights 部分的数据以及描述 Tech Bits 时