使用 Replicate LlaVa, Fuyu 8B, MiniGPT4 模型进行图像推理的多模态 LLM¶

在这个笔记本中，我们将展示如何使用 MultiModal LLM 类进行图像理解/推理。我们现在支持：

在第二部分，我们将展示如何为 Replicate 使用 stream complete 和 async complete。

注意：目前，Replicate 多模态 LLM 仅支持一次处理一个图像文档。

In [ ]

已复制！

%pip install llama-index-multi-modal-llms-replicate
%pip install llama-index-multi-modal-llms-replicate

In [ ]

已复制！

% pip install replicate
% pip install replicate

加载并初始化 Replicate¶

In [ ]

已复制！

import os

REPLICATE_API_TOKEN = ""  # Your Relicate API token here
os.environ["REPLICATE_API_TOKEN"] = REPLICATE_API_TOKEN
import os REPLICATE_API_TOKEN = "" # 在此处填写您的 Replicate API Token os.environ["REPLICATE_API_TOKEN"] = REPLICATE_API_TOKEN

下载图像并加载本地图像¶

In [ ]

已复制！





from PIL import Image
import requests
from io import BytesIO

from llama_index.core.multi_modal_llms.generic_utils import load_image_urls
from llama_index.core.schema import ImageDocument


if not os.path.exists("test_images"):
    os.makedirs("test_images")

# for now fuyu-8b model on replicate can mostly handle JPG image urls well instead of local files
image_urls = [
    # "https://www.visualcapitalist.com/wp-content/uploads/2023/10/US_Mortgage_Rate_Surge-Sept-11-1.jpg",
    "https://www.sportsnet.ca/wp-content/uploads/2023/11/CP1688996471-1040x572.jpg",
    "https://res.cloudinary.com/hello-tickets/image/upload/c_limit,f_auto,q_auto,w_1920/v1640835927/o3pfl41q7m5bj8jardk0.jpg",
    "https://www.cleverfiles.com/howto/wp-content/uploads/2018/03/minion.jpg",
]
# save images
for idx, image_url in enumerate(image_urls):
    response = requests.get(image_url)
    img = Image.open(BytesIO(response.content))
    img.save(f"test_images/{idx}.png")

# option 1: load images from urls directly
# image_documents = load_image_urls(image_urls)
# option 2: load images from local
image_documents = [
    ImageDocument(image_path=f"test_images/{idx}.png")
    for idx in range(len(image_urls))
]
from PIL import Image import requests from io import BytesIO from llama_index.core.multi_modal_llms.generic_utils import load_image_urls from llama_index.core.schema import ImageDocument if not os.path.exists("test_images"): os.makedirs("test_images") # 目前，Replicate 上的 fuyu-8b 模型主要能很好地处理 JPG 图像 URL，而不是本地文件 image_urls = [ # "https://www.visualcapitalist.com/wp-content/uploads/2023/10/US_Mortgage_Rate_Surge-Sept-11-1.jpg", "https://www.sportsnet.ca/wp-content/uploads/2023/11/CP1688996471-1040x572.jpg", "https://res.cloudinary.com/hello-tickets/image/upload/c_limit,f_auto,q_auto,w_1920/v1640835927/o3pfl41q7m5bj8jardk0.jpg", "https://www.cleverfiles.com/howto/wp-content/uploads/2018/03/minion.jpg", ] # 保存图像 for idx, image_url in enumerate(image_urls): response = requests.get(image_url) img = Image.open(BytesIO(response.content)) img.save(f"test_images/{idx}.png") # 选项 1：直接从 URL 加载图像 # image_documents = load_image_urls(image_urls) # 选项 2：从本地图像加载 image_documents = [ ImageDocument(image_path=f"test_images/{idx}.png") for idx in range(len(image_urls)) ]

可视化图像¶

In [ ]

已复制！

import matplotlib.pyplot as plt
from llama_index.core.response.notebook_utils import display_image_uris

image_paths = [str(img_doc.image_path) for img_doc in image_documents]
display_image_uris(image_paths)
import matplotlib.pyplot as plt from llama_index.core.response.notebook_utils import display_image_uris image_paths = [str(img_doc.image_path) for img_doc in image_documents] display_image_uris(image_paths)

No description has been provided for this image

In [ ]

已复制！





from llama_index.multi_modal_llms.replicate import ReplicateMultiModal
from llama_index.multi_modal_llms.replicate.base import (
    REPLICATE_MULTI_MODAL_LLM_MODELS,
)

prompts = [
    "what is shown in this image?",
    "how many people are shown in the image?",
    "is there anything unusual in the image?",
]
from llama_index.multi_modal_llms.replicate import ReplicateMultiModal from llama_index.multi_modal_llms.replicate.base import ( REPLICATE_MULTI_MODAL_LLM_MODELS, ) prompts = [ "what is shown in this image?", "how many people are shown in the image?", "is there anything unusual in the image?", ]

使用不同的 LLM、不同的提示和不同的图像生成图像推理结果¶

In [ ]

已复制！





res = []
for prompt_idx, prompt in enumerate(prompts):
    for image_idx, image_doc in enumerate(image_documents):
        for llm_idx, llm_model in enumerate(REPLICATE_MULTI_MODAL_LLM_MODELS):
            try:
                ## Initialize the MultiModal LLM model
                multi_modal_llm = ReplicateMultiModal(
                    model=REPLICATE_MULTI_MODAL_LLM_MODELS[llm_model],
                    max_new_tokens=100,
                    temperature=0.1,
                    num_input_files=1,
                    top_p=0.9,
                    num_beams=1,
                    repetition_penalty=1,
                )

                mm_resp = multi_modal_llm.complete(
                    prompt=prompt,
                    image_documents=[image_doc],
                )
            except Exception as e:
                print(
                    f"Error with LLM model inference with prompt {prompt}, image {image_idx}, and MM model {llm_model}"
                )
                print("Inference Failed due to: ", e)
                continue
            res.append(
                {
                    "model": llm_model,
                    "prompt": prompt,
                    "response": mm_resp,
                    "image": str(image_doc.image_path),
                }
            )
res = [] for prompt_idx, prompt in enumerate(prompts): for image_idx, image_doc in enumerate(image_documents): for llm_idx, llm_model in enumerate(REPLICATE_MULTI_MODAL_LLM_MODELS): try: ## 初始化多模态 LLM 模型 multi_modal_llm = ReplicateMultiModal( model=REPLICATE_MULTI_MODAL_LLM_MODELS[llm_model], max_new_tokens=100, temperature=0.1, num_input_files=1, top_p=0.9, num_beams=1, repetition_penalty=1, ) mm_resp = multi_modal_llm.complete( prompt=prompt, image_documents=[image_doc], ) except Exception as e: print( f"LLM 模型推理出错，提示：{prompt}，图像：{image_idx}，MM 模型：{llm_model}" ) print("推理失败，原因：", e) continue res.append( { "model": llm_model, "prompt": prompt, "response": mm_resp, "image": str(image_doc.image_path), } )

In [ ]

已复制！

from IPython.display import display
import pandas as pd

pd.options.display.max_colwidth = None
df = pd.DataFrame(res)
display(df[:5])
from IPython.display import display import pandas as pd pd.options.display.max_colwidth = None df = pd.DataFrame(res) display(df[:5])

	模型	提示	响应	图像
0	llava-13b	这张图片显示了什么？	图片显示了一名男子手持金色奖杯，可能是足球奖杯，他穿着西装打着领带。	test_images/0.png
1	fuyu-8b	这张图片显示了什么？	图片显示了一名穿着西装、手持金色球形奖杯的男子。	test_images/0.png
2	minigpt-4	这张图片显示了什么？	图片显示了一名穿着黑色西装打着领带、手持金色奖杯的男子。	test_images/0.png
3	llava-13b	这张图片显示了什么？	图片显示了一座巨大的发光建筑，它是意大利罗马的斗兽场。这座建筑在夜间被灯光照亮，营造出美丽而戏剧性的效果。	test_images/1.png
4	fuyu-8b	这张图片显示了什么？	图片展示了夜间的城市街道，彩色的灯光照亮了场景。街道两旁是建筑物，其中包括一座引人注目的罗马风格圆形剧场。	test_images/1.png

人工标注多模态 LLM 推理结果的正确性和相关性¶

注意：人工标注相关性和正确性可能存在一些偏差/主观性。

我们将正确性和相关性分数标注在 [1, 5] 范围内。
5：完美回答问题
4：一定程度上回答了问题
3：部分回答了问题
2：回答错误

1：没有回答或出现“幻觉”现象	模型	提示/问题	模型推理结果	图像
llava-13b	这张图片显示了什么？	正确性和相关性 [1,5]	4	test_images/0.png
fuyu-8b	这张图片显示了什么？	图片显示了一名穿着西装、手持金色球形奖杯的男子。	4	test_images/0.png
minigpt-4	这张图片显示了什么？	图片显示了一名穿着黑色西装打着领带、手持金色奖杯的男子。	4	test_images/0.png
llava-13b	这张图片显示了什么？	图片显示了一名男子手持奖杯，奖杯看起来像一个金色的足球。他穿着西装打着领带，手持奖杯微笑着。	4	test_images/1.png
fuyu-8b	这张图片显示了什么？	图片展示了夜间的城市街道，彩色的灯光照亮了场景。街道两旁是建筑物，其中包括一座引人注目的罗马风格圆形剧场。	4	test_images/1.png
minigpt-4	这张图片显示了什么？	图片显示了一座巨大的发光建筑，可能是历史古迹或著名地标，在夜间被彩灯照亮，创造出引人注目的视觉效果。	5	test_images/1.png
llava-13b	这张图片显示了什么？	图片显示了斗兽场，它是意大利罗马的一座大型圆形剧场。它是这座城市最著名的地标之一，也是热门旅游景点。	5	图片显示了电影《神偷奶爸》中的两个小黄人动画角色站在一起。
fuyu-8b	这张图片显示了什么？	test_images/2.png	5	图片显示了电影《神偷奶爸》中的两个小黄人动画角色站在一起。
minigpt-4	这张图片显示了什么？	图片中有两个小黄人形象，一个穿着背带裤，另一个戴着护目镜。他们站在一起，可能在交谈或摆姿势拍照。	4	图片显示了电影《神偷奶爸》中的两个小黄人动画角色站在一起。
llava-13b	图片中有多少人？	图片中只显示了一个人，一个手持奖杯的男子。	5	test_images/0.png
fuyu-8b	图片中有多少人？	图片中有一名穿着西装、手持金色球或奖杯的男子。	5	test_images/0.png
minigpt-4	图片中有多少人？	抱歉，我看不到您提供的图片。请您再次提供图片好吗？	NA	test_images/0.png
llava-13b	图片中有多少人？	图片中显示了两个人。	1	test_images/1.png
fuyu-8b	图片中有多少人？	图片展示了夜间的城市街道，彩色灯光照亮了场景。街道两旁是建筑物，包括一座引人注目的罗马式圆形剧场。\n\n有几个人在街上行走，有的提着手袋。人行道旁可以看到几张长椅，供行人休息。气氛很活跃，人们正在进行日常活动。	4	test_images/1.png
minigpt-4	图片中有多少人？	抱歉，我无法看到您提供的图片。请提供更多信息或另一张图片好吗？	NA	test_images/1.png
llava-13b	图片中有多少人？	图片中显示了两个人。	4	图片显示了电影《神偷奶爸》中的两个小黄人动画角色站在一起。
fuyu-8b	图片中有多少人？	图片中显示了两个人。	4	图片显示了电影《神偷奶爸》中的两个小黄人动画角色站在一起。
minigpt-4	图片中有多少人？	图片中有两个小黄人。	5	图片显示了电影《神偷奶爸》中的两个小黄人动画角色站在一起。
llava-13b	图片中有没有什么不寻常的地方？	是的，图片中有一个不寻常的元素。这名男子拿着一个金色的奖杯，这并不罕见。然而，他还拿着一个麦克风，这是一个不同寻常的组合。通常，拿着麦克风的人是演讲者或主持人，而奖杯通常与奖项或成就相关。麦克风和奖杯的组合表明，这名男子可能是一位正在领奖或表彰他人成就的主持人或播报员。	4	test_images/0.png
fuyu-8b	图片中有没有什么不寻常的地方？	是的，图片中有一些不寻常的地方，因为有一个穿着西装的男子拿着一个金球或奖杯。这很不寻常，因为通常体育用球或奖杯与比赛或成就相关联，而不是由穿着西装的人拿着。拿着金球或奖杯的男子穿着西装，这增加了场景的奇特之处。	3	test_images/0.png
minigpt-4	图片中有没有什么不寻常的地方？	图片显示了一名穿着黑色西装打着领带的男子，手里拿着一个金色奖杯。奖杯上的图案看起来像一个足球。这名男子面带微笑，看起来很开心。	3	test_images/0.png
llava-13b	图片中有没有什么不寻常的地方？	是的，图片中有一个不寻常之处。著名的古罗马圆形剧场斗兽场被彩色灯光照亮，这不是常见的景象。这种照明为这座历史悠久、雄伟壮丽的建筑增添了独特而充满活力的色彩。	4	test_images/1.png
fuyu-8b	图片中有没有什么不寻常的地方？	是的，图片展示了一个不同寻常的场景，有多人在巨大的建筑周围行走，这是一座古老的罗马圆形剧场。圆形剧场被彩色灯光照亮，营造出一种充满活力和生机的氛围。	3	test_images/1.png
minigpt-4	图片中有没有什么不寻常的地方？	图片显示了斗兽场，这是位于意大利罗马的一座大型圆形剧场。它是著名的地标和旅游景点。	2	test_images/1.png
llava-13b	图片中有没有什么不寻常的地方？	是的，图片中有一个不寻常之处。两个卡通小黄人站在一起，但其中一个缺了一颗牙。这是一个不寻常的细节，因为动画角色通常没有像缺牙这样的缺陷。缺牙为图片增添了独特有趣的方面，使其在典型的动画场景中脱颖而出。	3	图片显示了电影《神偷奶爸》中的两个小黄人动画角色站在一起。
fuyu-8b	图片中有没有什么不寻常的地方？	是的，图片中有一个不寻常之处，有两个穿着背带裤、戴着护目镜的小黄人站在一起。这种不寻常的组合并不常见，因为小黄人通常与他们流行的动画和电影系列相关联。	2	图片显示了电影《神偷奶爸》中的两个小黄人动画角色站在一起。
minigpt-4	图片中有没有什么不寻常的地方？	图片看起来是一个穿着背带裤和黄色衬衫的卡通角色。角色面带微笑，头上戴着一顶蓝色帽子。图片中没有什么不寻常的地方。	5	图片显示了电影《神偷奶爸》中的两个小黄人动画角色站在一起。

首先，本笔记的目的是展示如何利用 Replicate 服务不同的多模态大型语言模型（LLMs）来执行图像推理任务。这种比较存在一些局限性

我们针对一些简单有限的任务/提示，对 LLaVa-13B、Fuyu-8B 和 MiniGPT-4 进行了比较和评估。
注意，示例中不同模型的超参数是相同的。超参数调优对于多模态大型语言模型（MM LLMs）的质量可能具有显著影响。
人工评估可能存在偏差/主观性/噪声

一些初步发现

MiniGPT-4 有时能给出更准确的答案，例如对图片中的内容回答 图片中有两个小黄人。，而不是 LlaVa 或 Fuyu-8B 回答的 图片中显示了两个人。。另一个例子是，对于意大利斗兽场的图片，MiniGPT-4 对 图片中是什么 这个问题的回答直接就是 斗兽场。
MiniGPT-4 未能对两个提示给出结果。它回答 抱歉，我无法看到您提供的图片。 但它可以回答同一图片的其他问题。不确定这是 Replicate 推理的问题还是 MiniGPT-4 模型本身的问题
Fuyu-8B 和 LlaVa-13B 通常会给出更长、更详细的答案，并提供更多上下文支持。
Llava-13B 和 Fuyu-8B 有时会产生轻微的 幻觉，特别是对于 图片中有没有什么不寻常的地方？ 这个问题。

Replicate Stream Complete, Async Complete, Async Stream Complete 模式¶

初始化 Fuyu-8B 模型¶

In [ ]

已复制！





multi_modal_llm = ReplicateMultiModal(
    model=REPLICATE_MULTI_MODAL_LLM_MODELS["fuyu-8b"],
    max_new_tokens=100,
    temperature=0.1,
    num_input_files=1,
    top_p=0.9,
    num_beams=1,
    repetition_penalty=1,
)
multi_modal_llm = ReplicateMultiModal( model=REPLICATE_MULTI_MODAL_LLM_MODELS["fuyu-8b"], max_new_tokens=100, temperature=0.1, num_input_files=1, top_p=0.9, num_beams=1, repetition_penalty=1, )

使用 async stream complete 模式¶

In [ ]

已复制！

resp = await multi_modal_llm.astream_complete(
    prompt="tell me about this image",
    image_documents=[image_documents[0]],
)
resp = await multi_modal_llm.astream_complete( prompt="tell me about this image", image_documents=[image_documents[0]], )

In [ ]

已复制！

async for delta in resp:
    print(delta.delta, end="")
async for delta in resp: print(delta.delta, end="")

 The image features a man wearing a suit and tie, standing in front of a stage with a backdrop. He is holding a golden ball trophy, possibly an award, in his hands. The man appears to be posing for a photo, possibly celebrating his achievement or receiving an award.

In the background, there are multiple people visible, possibly attending or participating in the event. The backdrop appears to be a large screen, possibly displaying information about the event or ceremony.

使用 async complete 模式¶

In [ ]

已复制！

resp = await multi_modal_llm.acomplete(
    prompt="tell me about this image",
    image_documents=[image_documents[0]],
)
resp = await multi_modal_llm.acomplete( prompt="tell me about this image", image_documents=[image_documents[0]], )

In [ ]

已复制！

print(resp)
print(resp)

 The image features a man wearing a suit and tie, standing in front of a stage with a backdrop. He is holding a golden ball trophy, possibly an award, in his hands. The man appears to be posing for a photo, possibly celebrating his achievement or receiving an award.

In the background, there are multiple people visible, possibly attending or participating in the event. The backdrop appears to be a large screen, possibly displaying information about the event or ceremony.

使用 stream complete 模式¶

In [ ]

已复制！

resp = multi_modal_llm.stream_complete(
    prompt="tell me about this image",
    image_documents=[image_documents[0]],
)
resp = multi_modal_llm.stream_complete( prompt="tell me about this image", image_documents=[image_documents[0]], )

In [ ]

已复制！

for delta in resp:
    print(delta.delta, end="")
for delta in resp: print(delta.delta, end="")

 The image features a man wearing a suit and tie, standing in front of a stage with a backdrop. He is holding a golden ball trophy, possibly an award, in his hands. The man appears to be posing for a photo, possibly celebrating his achievement or receiving an award.

In the background, there are multiple people visible, possibly attending or participating in the event. The backdrop appears to be a large screen, possibly displaying information about the event or ceremony.

使用 Replicate LlaVa, Fuyu 8B, MiniGPT4 模型进行图像推理的多模态 LLM¶

加载并初始化 Replicate¶

下载图像并加载本地图像¶

可视化图像¶

提供各种提示以测试不同的多模态 LLM¶

使用不同的 LLM、不同的提示和不同的图像生成图像推理结果¶

显示多模态 LLM 的抽样响应¶

图片显示了一座巨大的发光建筑，可能是历史古迹或著名地标，在夜间被彩灯照亮，创造出引人注目的视觉效果。

多模态模型评估初步发现总结¶

Replicate Stream Complete, Async Complete, Async Stream Complete 模式¶

初始化 Fuyu-8B 模型¶

使用 async stream complete 模式¶

使用 async complete 模式¶

使用 stream complete 模式¶