评估多模态 RAG¶

在本笔记本指南中，我们将演示如何评估一个多模态 RAG 系统。与纯文本情况一样，我们将分别考虑检索器和生成器的评估。正如我们在关于评估多模态 RAGs 的博客中提到的那样，我们此处的方法包括应用用于评估检索器和生成器（用于纯文本情况）的常用技术的改编版本。这些改编版本是 llama-index 库（即 evaluation 模块）的一部分，本笔记本将引导您了解如何将它们应用于您的评估用例。

注：此处进行的用例及其评估纯粹是说明性的，仅旨在演示如何将我们的评估工具应用于您的特定需求。并且，此处进行的结果或分析绝不意在严格——尽管我们相信我们的工具可以帮助您提高您应用程序的标准护理水平。

输入 [ ]

已复制!

%pip install llama-index-llms-openai
%pip install llama-index-multi-modal-llms-openai
%pip install llama-index-multi-modal-llms-replicate
%pip install llama-index-llms-openai %pip install llama-index-multi-modal-llms-openai %pip install llama-index-multi-modal-llms-replicate

输入 [ ]

已复制!

# %pip install llama_index ftfy regex tqdm -q
# %pip install git+https://github.com/openai/CLIP.git -q
# %pip install torch torchvision -q
# %pip install matplotlib scikit-image -q
# %pip install -U qdrant_client -q
# %pip install llama_index ftfy regex tqdm -q # %pip install git+https://github.com/openai/CLIP.git -q # %pip install torch torchvision -q # %pip install matplotlib scikit-image -q # %pip install -U qdrant_client -q

输入 [ ]

已复制!

from PIL import Image
import matplotlib.pyplot as plt
import pandas as pd
from PIL import Image import matplotlib.pyplot as plt import pandas as pd

用例：ASL 中的拼写¶

在整个演示中，我们将使用的特定用例与使用图像和文本描述来签署美国手语 (ASL) 字母有关。

查询¶

对于本次演示，我们将只使用一种查询形式。（这并不是一个具有代表性的用例，但这里的重点是展示 llama-index 评估工具在执行评估时的应用。）

输入 [ ]

已复制!

QUERY_STR_TEMPLATE = "How can I sign a {symbol}?."
QUERY_STR_TEMPLATE = "How can I sign a {symbol}?."

数据集¶

图像

图像取自 ASL-Alphabet Kaggle 数据集。请注意，它们已被修改为仅在手势图像上包含相关字母的标签。这些修改后的图像是我们用作用户查询上下文的图像，您可以从我们的 Google Drive 下载（请参阅下面的单元格，您可以取消注释以直接从本笔记本下载数据集）。

文本上下文

对于文本上下文，我们使用了每个手势的描述，这些描述来源于 https://www.deafblind.com/asl.html。为了方便起见，我们已将这些描述存储在一个名为 asl_text_descriptions.json 的 json 文件中，该文件包含在我们 Google Drive 的 zip 下载中。

输入 [ ]

已复制!





#######################################################################
## This notebook guide makes several calls to gpt-4v, which is       ##
## heavily rate limited. For convenience, you should download data   ##
## files to avoid making such calls and still follow along with the  ##
## notebook. Unzip the zip file and store in a folder asl_data in    ##
## the same directory as this notebook.                              ##
#######################################################################

download_notebook_data = False
if download_notebook_data:
    !wget "https://www.dropbox.com/scl/fo/tpesl5m8ye21fqza6wq6j/h?rlkey=zknd9pf91w30m23ebfxiva9xn&dl=1" -O asl_data.zip -q
####################################################################### ## 本笔记本指南多次调用 gpt-4v，该模型 ## ## 受速率限制严重。为方便起见，您应下载数据 ## ## 文件以避免此类调用，并且仍可跟随笔记本进行。 ## ## 解压 zip 文件并将其存储在名为 asl_data 的文件夹中，与本笔记本位于同一目录。 ## ####################################################################### download_notebook_data = False if download_notebook_data: !wget "https://www.dropbox.com/scl/fo/tpesl5m8ye21fqza6wq6j/h?rlkey=zknd9pf91w30m23ebfxiva9xn&dl=1" -O asl_data.zip -q

首先，我们将上下文图像和文本分别加载到 ImageDocument 和 Documents 中。

输入 [ ]

已复制!





import json
from llama_index.core.multi_modal_llms.generic_utils import load_image_urls
from llama_index.core import SimpleDirectoryReader, Document

# context images
image_path = "./asl_data/images"
image_documents = SimpleDirectoryReader(image_path).load_data()

# context text
with open("asl_data/asl_text_descriptions.json") as json_file:
    asl_text_descriptions = json.load(json_file)
text_format_str = "To sign {letter} in ASL: {desc}."
text_documents = [
    Document(text=text_format_str.format(letter=k, desc=v))
    for k, v in asl_text_descriptions.items()
]
import json from llama_index.core.multi_modal_llms.generic_utils import load_image_urls from llama_index.core import SimpleDirectoryReader, Document # context images image_path = "./asl_data/images" image_documents = SimpleDirectoryReader(image_path).load_data() # context text with open("asl_data/asl_text_descriptions.json") as json_file: asl_text_descriptions = json.load(json_file) text_format_str = "To sign {letter} in ASL: {desc}." text_documents = [ Document(text=text_format_str.format(letter=k, desc=v)) for k, v in asl_text_descriptions.items() ]

准备好文档后，我们可以创建我们的 MultiModalVectorStoreIndex。为此，我们将 Documents 解析为节点，然后简单地将这些节点传递给 MultiModalVectorStoreIndex 构造函数。

输入 [ ]

已复制!

from llama_index.core.indices import MultiModalVectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter

node_parser = SentenceSplitter.from_defaults()
image_nodes = node_parser.get_nodes_from_documents(image_documents)
text_nodes = node_parser.get_nodes_from_documents(text_documents)

asl_index = MultiModalVectorStoreIndex(image_nodes + text_nodes)
from llama_index.core.indices import MultiModalVectorStoreIndex from llama_index.core.node_parser import SentenceSplitter node_parser = SentenceSplitter.from_defaults() image_nodes = node_parser.get_nodes_from_documents(image_documents) text_nodes = node_parser.get_nodes_from_documents(text_documents) asl_index = MultiModalVectorStoreIndex(image_nodes + text_nodes)

另一个需要考虑的 RAG 系统（用于检索的 GPT-4V 图像描述）¶

对于之前的 MultiModalVectorStoreIndex，图像的默认嵌入模型是 OpenAI CLIP。为了与另一个 RAG 系统进行比较（这通常是执行 RAG 评估的原因），我们将搭建另一个 RAG 系统，该系统使用与默认模型不同的图像嵌入。

特别是，我们将提示 GPT-4V 编写每张图像的文本描述，然后对这些描述应用常用的文本嵌入，并将这些嵌入与图像关联起来。也就是说，这些文本描述嵌入将是此 RAG 系统中最终用于执行检索的内容。

输入 [ ]

已复制!

#######################################################################
## Set load_previously_generated_text_descriptions to True if you    ##
## would rather use previously generated gpt-4v text descriptions    ##
## that are included in the .zip download                            ##
#######################################################################

load_previously_generated_text_descriptions = True
####################################################################### ## 如果您更愿意使用先前生成的 gpt-4v 文本描述，请将 load_previously_generated_text_descriptions ## ## 设置为 True。该描述包含在 .zip 下载中 ## ####################################################################### load_previously_generated_text_descriptions = True

输入 [ ]

已复制!





from llama_index.multi_modal_llms.openai import OpenAIMultiModal
from llama_index.core.schema import ImageDocument
import tqdm

if not load_previously_generated_text_descriptions:
    # define our lmm
    openai_mm_llm = OpenAIMultiModal(model="gpt-4o", max_new_tokens=300)

    # make a new copy since we want to store text in its attribute
    image_with_text_documents = SimpleDirectoryReader(image_path).load_data()

    # get text desc and save to text attr
    for img_doc in tqdm.tqdm(image_with_text_documents):
        response = openai_mm_llm.complete(
            prompt="Describe the images as an alternative text",
            image_documents=[img_doc],
        )
        img_doc.text = response.text

    # save so don't have to incur expensive gpt-4v calls again
    desc_jsonl = [
        json.loads(img_doc.to_json()) for img_doc in image_with_text_documents
    ]
    with open("image_descriptions.json", "w") as f:
        json.dump(desc_jsonl, f)
else:
    # load up previously saved image descriptions and documents
    with open("asl_data/image_descriptions.json") as f:
        image_descriptions = json.load(f)

    image_with_text_documents = [
        ImageDocument.from_dict(el) for el in image_descriptions
    ]

# parse into nodes
image_with_text_nodes = node_parser.get_nodes_from_documents(
    image_with_text_documents
)
from llama_index.multi_modal_llms.openai import OpenAIMultiModal from llama_index.core.schema import ImageDocument import tqdm if not load_previously_generated_text_descriptions: # define our lmm openai_mm_llm = OpenAIMultiModal(model="gpt-4o", max_new_tokens=300) # make a new copy since we want to store text in its attribute image_with_text_documents = SimpleDirectoryReader(image_path).load_data() # get text desc and save to text attr for img_doc in tqdm.tqdm(image_with_text_documents): response = openai_mm_llm.complete( prompt="Describe the images as an alternative text", image_documents=[img_doc], ) img_doc.text = response.text # save so don't have to incur expensive gpt-4v calls again desc_jsonl = [ json.loads(img_doc.to_json()) for img_doc in image_with_text_documents ] with open("image_descriptions.json", "w") as f: json.dump(desc_jsonl, f) else: # load up previously saved image descriptions and documents with open("asl_data/image_descriptions.json") as f: image_descriptions = json.load(f) image_with_text_documents = [ ImageDocument.from_dict(el) for el in image_descriptions ] # parse into nodes image_with_text_nodes = node_parser.get_nodes_from_documents( image_with_text_documents )

细心的读者会注意到，我们将文本描述存储在了 ImageDocument 的 text 字段中。像之前一样，要创建 MultiModalVectorStoreIndex，我们需要将 ImageDocuments 解析为 ImageNodes，然后将这些节点传递给构造函数。

请注意，当使用填充了 text 字段的 ImageNodes 构建 MultiModalVectorStoreIndex 时，我们可以选择使用此文本构建用于检索的嵌入。为此，我们只需将类属性 is_image_to_text 指定为 True。

输入 [ ]

已复制!

image_with_text_nodes = node_parser.get_nodes_from_documents(
    image_with_text_documents
)

asl_text_desc_index = MultiModalVectorStoreIndex(
    nodes=image_with_text_nodes + text_nodes, is_image_to_text=True
)
image_with_text_nodes = node_parser.get_nodes_from_documents( image_with_text_documents ) asl_text_desc_index = MultiModalVectorStoreIndex( nodes=image_with_text_nodes + text_nodes, is_image_to_text=True )

与纯文本情况一样，我们需要为我们的索引“附加”一个生成器（可以用作检索器），以便最终组装我们的 RAG 系统。然而，在多模态情况下，我们的生成器是多模态 LLM（通常也简称为 LMM，即大型多模态模型）。在本笔记本中，为了进一步比较不同的 RAG 系统，我们将使用 GPT-4V 和 LLaVA。我们可以通过调用索引的 as_query_engine 方法来“附加”生成器并获得可查询的 RAG 接口。

输入 [ ]

已复制!





from llama_index.multi_modal_llms.openai import OpenAIMultiModal
from llama_index.multi_modal_llms.replicate import ReplicateMultiModal
from llama_index.core import PromptTemplate

# define our QA prompt template
qa_tmpl_str = (
    "Images of hand gestures for ASL are provided.\n"
    "---------------------\n"
    "{context_str}\n"
    "---------------------\n"
    "If the images provided cannot help in answering the query\n"
    "then respond that you are unable to answer the query. Otherwise,\n"
    "using only the context provided, and not prior knowledge,\n"
    "provide an answer to the query."
    "Query: {query_str}\n"
    "Answer: "
)
qa_tmpl = PromptTemplate(qa_tmpl_str)

# define our lmms
openai_mm_llm = OpenAIMultiModal(
    model="gpt-4o",
    max_new_tokens=300,
)

llava_mm_llm = ReplicateMultiModal(
    model="yorickvp/llava-13b:2facb4a474a0462c15041b78b1ad70952ea46b5ec6ad29583c0b29dbd4249591",
    max_new_tokens=300,
)

# define our RAG query engines
rag_engines = {
    "mm_clip_gpt4v": asl_index.as_query_engine(
        multi_modal_llm=openai_mm_llm, text_qa_template=qa_tmpl
    ),
    "mm_clip_llava": asl_index.as_query_engine(
        multi_modal_llm=llava_mm_llm,
        text_qa_template=qa_tmpl,
    ),
    "mm_text_desc_gpt4v": asl_text_desc_index.as_query_engine(
        multi_modal_llm=openai_mm_llm, text_qa_template=qa_tmpl
    ),
    "mm_text_desc_llava": asl_text_desc_index.as_query_engine(
        multi_modal_llm=llava_mm_llm, text_qa_template=qa_tmpl
    ),
}

# llava only supports 1 image per call at current moment
rag_engines["mm_clip_llava"].retriever.image_similarity_top_k = 1
rag_engines["mm_text_desc_llava"].retriever.image_similarity_top_k = 1
from llama_index.multi_modal_llms.openai import OpenAIMultiModal from llama_index.multi_modal_llms.replicate import ReplicateMultiModal from llama_index.core import PromptTemplate # define our QA prompt template qa_tmpl_str = ( "Images of hand gestures for ASL are provided.\n" "---------------------\n" "{context_str}\n" "---------------------\n" "If the images provided cannot help in answering the query\n" "then respond that you are unable to answer the query. Otherwise,\n" "using only the context provided, and not prior knowledge,\n" "provide an answer to the query." "Query: {query_str}\n" "Answer: " ) qa_tmpl = PromptTemplate(qa_tmpl_str) # define our lmms openai_mm_llm = OpenAIMultiModal( model="gpt-4o", max_new_tokens=300, ) llava_mm_llm = ReplicateMultiModal( model="yorickvp/llava-13b:2facb4a474a0462c15041b78b1ad70952ea46b5ec6ad29583c0b29dbd4249591", max_new_tokens=300, ) # define our RAG query engines rag_engines = { "mm_clip_gpt4v": asl_index.as_query_engine( multi_modal_llm=openai_mm_llm, text_qa_template=qa_tmpl ), "mm_clip_llava": asl_index.as_query_engine( multi_modal_llm=llava_mm_llm, text_qa_template=qa_tmpl, ), "mm_text_desc_gpt4v": asl_text_desc_index.as_query_engine( multi_modal_llm=openai_mm_llm, text_qa_template=qa_tmpl ), "mm_text_desc_llava": asl_text_desc_index.as_query_engine( multi_modal_llm=llava_mm_llm, text_qa_template=qa_tmpl ), } # llava only supports 1 image per call at current moment rag_engines["mm_clip_llava"].retriever.image_similarity_top_k = 1 rag_engines["mm_text_desc_llava"].retriever.image_similarity_top_k = 1

让我们试驾其中一个系统。为了美观地显示响应，我们使用了笔记本实用函数 display_query_and_multimodal_response。

输入 [ ]

已复制!

letter = "R"
query = QUERY_STR_TEMPLATE.format(symbol=letter)
response = rag_engines["mm_text_desc_gpt4v"].query(query)
letter = "R" query = QUERY_STR_TEMPLATE.format(symbol=letter) response = rag_engines["mm_text_desc_gpt4v"].query(query)

输入 [ ]

已复制!

from llama_index.core.response.notebook_utils import (
    display_query_and_multimodal_response,
)

display_query_and_multimodal_response(query, response)
from llama_index.core.response.notebook_utils import ( display_query_and_multimodal_response, ) display_query_and_multimodal_response(query, response)

Query: How can I sign a R?.
=======
Retrieved Images:

No description has been provided for this image

=======
Response: To sign the letter "R" in American Sign Language (ASL), you would follow the instructions provided: the ring and little finger should be folded against the palm and held down by your thumb, while the index and middle finger are straight and crossed with the index finger in front to form the letter "R."
=======

检索器评估¶

在本笔记本的这一部分，我们将进行检索器的评估。回顾一下，我们基本上有两个多模态检索器：一个使用默认的 CLIP 图像嵌入；另一个使用相关的 gpt-4v 文本描述的嵌入。在进行性能的定量分析之前，我们首先对 text_desc_retriever（如果需要，只需换成 clip_retriever 即可！）在所有请求签署每个 ASL 字母的用户查询上的 top-1 检索结果进行可视化。

注：由于我们没有将检索到的文档发送给 LLaVA，我们可以将 image_simiarity_top_k 设置为大于 1 的值。当我们执行生成评估时，对于使用 LLaVA 的 RAG 引擎，我们将不得不再次使用上面定义的 rag_engine，其此参数设置为 1。

输入 [ ]

已复制!

# use as retriever
clip_retriever = asl_index.as_retriever(image_similarity_top_k=2)

# use as retriever
text_desc_retriever = asl_text_desc_index.as_retriever(
    image_similarity_top_k=2
)
# use as retriever clip_retriever = asl_index.as_retriever(image_similarity_top_k=2) # use as retriever text_desc_retriever = asl_text_desc_index.as_retriever( image_similarity_top_k=2 )

可视化¶

输入 [ ]

已复制!





from llama_index.core.schema import TextNode, ImageNode

f, axarr = plt.subplots(3, 9)
f.set_figheight(6)
f.set_figwidth(15)
ix = 0
for jx, letter in enumerate(asl_text_descriptions.keys()):
    retrieval_results = text_desc_retriever.retrieve(
        QUERY_STR_TEMPLATE.format(symbol=letter)
    )
    image_node = None
    text_node = None
    for r in retrieval_results:
        if isinstance(r.node, TextNode):
            text_node = r
        if isinstance(r.node, ImageNode):
            image_node = r
            break

    img_path = image_node.node.image_path
    image = Image.open(img_path).convert("RGB")
    axarr[int(jx / 9), jx % 9].imshow(image)
    axarr[int(jx / 9), jx % 9].set_title(f"Query: {letter}")

plt.setp(axarr, xticks=[0, 100, 200], yticks=[0, 100, 200])
f.tight_layout()
plt.show()
from llama_index.core.schema import TextNode, ImageNode f, axarr = plt.subplots(3, 9) f.set_figheight(6) f.set_figwidth(15) ix = 0 for jx, letter in enumerate(asl_text_descriptions.keys()): retrieval_results = text_desc_retriever.retrieve( QUERY_STR_TEMPLATE.format(symbol=letter) ) image_node = None text_node = None for r in retrieval_results: if isinstance(r.node, TextNode): text_node = r if isinstance(r.node, ImageNode): image_node = r break img_path = image_node.node.image_path image = Image.open(img_path).convert("RGB") axarr[int(jx / 9), jx % 9].imshow(image) axarr[int(jx / 9), jx % 9].set_title(f"Query: {letter}") plt.setp(axarr, xticks=[0, 100, 200], yticks=[0, 100, 200]) f.tight_layout() plt.show()

如您所见，检索器在 top-1 检索方面做得相当不错。现在，我们转向检索器性能的定量分析。

定量分析：命中率和 MRR¶

在我们的博客（在本笔记本开头链接）中，我们提到评估多模态检索器的一个明智方法是分别计算图像和文本检索的常用检索评估指标。这当然会使您的评估指标数量比纯文本情况多一倍，但这样做可以使您能够以更细粒度的方式调试 RAG/检索器。如果您想要一个单一指标，那么根据您的需求定制权重应用加权平均似乎是一个合理的选择。

为了完成所有这些工作，我们使用了 MultiModalRetrieverEvaluator，它类似于其单模态对应物，不同之处在于它可以分别处理图像和文本检索评估，这正是我们此处想要做的。

输入 [ ]

已复制!

from llama_index.core.evaluation import MultiModalRetrieverEvaluator

clip_retriever_evaluator = MultiModalRetrieverEvaluator.from_metric_names(
    ["mrr", "hit_rate"], retriever=clip_retriever
)

text_desc_retriever_evaluator = MultiModalRetrieverEvaluator.from_metric_names(
    ["mrr", "hit_rate"], retriever=text_desc_retriever
)
from llama_index.core.evaluation import MultiModalRetrieverEvaluator clip_retriever_evaluator = MultiModalRetrieverEvaluator.from_metric_names( ["mrr", "hit_rate"], retriever=clip_retriever ) text_desc_retriever_evaluator = MultiModalRetrieverEvaluator.from_metric_names( ["mrr", "hit_rate"], retriever=text_desc_retriever )

计算评估时需要注意的一件重要事情是，您通常需要地面真实数据（或有时也称为标记数据）。对于检索，这种标记数据采用 query、expected_ids 对的形式，其中前者是用户查询，后者表示应该检索的节点（由其 ID 表示）。

对于本指南，我们编写了一个特定的辅助函数来构建 LabelledQADataset 对象，这正是我们所需要的。

输入 [ ]

已复制!





import uuid
import re
from llama_index.core.evaluation import LabelledQADataset


def asl_create_labelled_retrieval_dataset(
    reg_ex, nodes, mode
) -> LabelledQADataset:
    """Returns a QALabelledDataset that provides the expected node IDs
    for every query.

    NOTE: this is specific to the ASL use-case.
    """
    queries = {}
    relevant_docs = {}
    for node in nodes:
        # find the letter associated with the image/text node
        if mode == "image":
            string_to_search = node.metadata["file_path"]
        elif mode == "text":
            string_to_search = node.text
        else:
            raise ValueError(
                "Unsupported mode. Please enter 'image' or 'text'."
            )
        match = re.search(reg_ex, string_to_search)
        if match:
            # build the query
            query = QUERY_STR_TEMPLATE.format(symbol=match.group(1))
            id_ = str(uuid.uuid4())
            # store the query and expected ids pair
            queries[id_] = query
            relevant_docs[id_] = [node.id_]

    return LabelledQADataset(
        queries=queries, relevant_docs=relevant_docs, corpus={}, mode=mode
    )
import uuid import re from llama_index.core.evaluation import LabelledQADataset def asl_create_labelled_retrieval_dataset( reg_ex, nodes, mode ) -> LabelledQADataset: """Returns a QALabelledDataset that provides the expected node IDs for every query. NOTE: this is specific to the ASL use-case. """ queries = {} relevant_docs = {} for node in nodes: # find the letter associated with the image/text node if mode == "image": string_to_search = node.metadata["file_path"] elif mode == "text": string_to_search = node.text else: raise ValueError( "Unsupported mode. Please enter 'image' or 'text'." ) match = re.search(reg_ex, string_to_search) if match: # build the query query = QUERY_STR_TEMPLATE.format(symbol=match.group(1)) id_ = str(uuid.uuid4()) # store the query and expected ids pair queries[id_] = query relevant_docs[id_] = [node.id_] return LabelledQADataset( queries=queries, relevant_docs=relevant_docs, corpus={}, mode=mode )

输入 [ ]

已复制!





# labelled dataset for image retrieval with asl_index.as_retriever()
qa_dataset_image = asl_create_labelled_retrieval_dataset(
    r"(?:([A-Z]+).jpg)", image_nodes, "image"
)

# labelled dataset for text retrieval with asl_index.as_retriever()
qa_dataset_text = asl_create_labelled_retrieval_dataset(
    r"(?:To sign ([A-Z]+) in ASL:)", text_nodes, "text"
)

# labelled dataset for text-desc with asl_text_desc_index.as_retriever()
qa_dataset_text_desc = asl_create_labelled_retrieval_dataset(
    r"(?:([A-Z]+).jpg)", image_with_text_nodes, "image"
)
# labelled dataset for image retrieval with asl_index.as_retriever() qa_dataset_image = asl_create_labelled_retrieval_dataset( r"(?:([A-Z]+).jpg)", image_nodes, "image" ) # labelled dataset for text retrieval with asl_index.as_retriever() qa_dataset_text = asl_create_labelled_retrieval_dataset( r"(?:To sign ([A-Z]+) in ASL:)", text_nodes, "text" ) # labelled dataset for text-desc with asl_text_desc_index.as_retriever() qa_dataset_text_desc = asl_create_labelled_retrieval_dataset( r"(?:([A-Z]+).jpg)", image_with_text_nodes, "image" )

现在我们有了地面真实数据，我们可以调用 MultiModalRetrieverEvaluator 的 evaluate_dataset（或其 async 版本）方法。

输入 [ ]

已复制!





eval_results_image = await clip_retriever_evaluator.aevaluate_dataset(
    qa_dataset_image
)
eval_results_text = await clip_retriever_evaluator.aevaluate_dataset(
    qa_dataset_text
)
eval_results_text_desc = await text_desc_retriever_evaluator.aevaluate_dataset(
    qa_dataset_text_desc
)
eval_results_image = await clip_retriever_evaluator.aevaluate_dataset( qa_dataset_image ) eval_results_text = await clip_retriever_evaluator.aevaluate_dataset( qa_dataset_text ) eval_results_text_desc = await text_desc_retriever_evaluator.aevaluate_dataset( qa_dataset_text_desc )

此外，我们将利用另一个笔记本实用函数 get_retrieval_results_df，它将很好地将我们的评估结果呈现到 pandas DataFrame 中。

输入 [ ]

已复制!





from llama_index.core.evaluation import get_retrieval_results_df

get_retrieval_results_df(
    names=["asl_index-image", "asl_index-text", "asl_text_desc_index"],
    results_arr=[
        eval_results_image,
        eval_results_text,
        eval_results_text_desc,
    ],
)
from llama_index.core.evaluation import get_retrieval_results_df get_retrieval_results_df( names=["asl_index-image", "asl_index-text", "asl_text_desc_index"], results_arr=[ eval_results_image, eval_results_text, eval_results_text_desc, ], )

输出 [ ]

	检索器	命中率	MRR
0	asl_index-图像	0.814815	0.814815
1	asl_index-文本	1.000000	1.000000
2	asl_text_desc_index	0.925926	0.925926

观察¶

如我们所见，asl_index 检索器的文本检索是完美的。考虑到用于创建存储在 text_nodes 中的文本的 QUERY_STR_TEMPLATE 和 text_format_str 非常相似，这是可以预期的。
CLIP 图像嵌入表现尚可，尽管在此案例中，源自 GPT-4V 文本描述的嵌入表示似乎带来了更好的检索性能。
有趣的是，两个检索器在检索到正确图像时都将其置于起始位置，这就是为什么它们的 hit_rate 和 mrr 相等的原因。

生成评估¶

现在让我们继续评估生成的响应。为此，我们考虑了我们先前构建的 4 个多模态 RAG 系统：

mm_clip_gpt4v = 使用 CLIP 图像编码器的多模态 RAG，lmm = 使用 image_nodes 和 text_nodes 的 GPT-4V
mm_clip_llava = 使用 CLIP 图像编码器的多模态 RAG，lmm = 使用 image_nodes 和 text_nodes 的 LLaVA
mm_text_desc_gpt4v = 使用文本描述 + ADA 图像编码器的多模态 RAG，lmm = 使用 image_with_text_nodes 和 text_nodes 的 GPT-4V
mm_text_desc_llava = 使用文本描述 + ADA 图像编码器的多模态 RAG，lmm = 使用 image_with_text_nodes 和 text_nodes 的 LLaVA

与检索器评估的情况一样，我们现在也需要用于评估生成响应的地面真实数据。（请注意，并非所有评估方法都需要地面真实数据，但我们将使用“正确性”评估，它需要参考答案来与生成的答案进行比较。）

参考（地面真实）数据¶

为此，我们另外获取了一组 ASL 手势的文本描述。我们发现这些描述更具描述性，并且感觉它们非常适合作为我们 ASL 查询的参考答案。来源：https://www.signingtime.com/dictionary/category/letters/，这些描述已被抓取并存储在 human_responses.json 文件中，该文件同样包含在本笔记本开头链接的数据 zip 下载中。

输入 [ ]

已复制!

# references (ground-truth) for our answers
with open("asl_data/human_responses.json") as json_file:
    human_answers = json.load(json_file)
# references (ground-truth) for our answers with open("asl_data/human_responses.json") as json_file: human_answers = json.load(json_file)

为每个系统生成所有查询的响应¶

现在，我们将循环遍历所有查询，并将它们传递给所有 4 个 RAG（即 QueryEngine.query() 接口）。

输入 [ ]

已复制!

#######################################################################
## Set load_previous_responses to True if you would rather use       ##
## previously generated responses for all rags. The json is part of  ##
## the .zip download                                                 ##
#######################################################################

load_previous_responses = True
####################################################################### ## 如果您更愿意使用所有 RAG 先前生成的响应，请将 load_previous_responses 设置为 True。该 JSON 文件是 ## ## .zip 下载的一部分 ## ####################################################################### load_previous_responses = True

输入 [ ]

已复制!





import time
import tqdm

if not load_previous_responses:
    response_data = []
    for letter in tqdm.tqdm(asl_text_descriptions.keys()):
        data_entry = {}
        query = QUERY_STR_TEMPLATE.format(symbol=letter)
        data_entry["query"] = query

        responses = {}
        for name, engine in rag_engines.items():
            this_response = {}
            result = engine.query(query)
            this_response["response"] = result.response

            sources = {}
            source_image_nodes = []
            source_text_nodes = []

            # image sources
            source_image_nodes = [
                score_img_node.node.metadata["file_path"]
                for score_img_node in result.metadata["image_nodes"]
            ]

            # text sources
            source_text_nodes = [
                score_text_node.node.text
                for score_text_node in result.metadata["text_nodes"]
            ]

            sources["images"] = source_image_nodes
            sources["texts"] = source_text_nodes
            this_response["sources"] = sources

            responses[name] = this_response
        data_entry["responses"] = responses
        response_data.append(data_entry)

    # save expensive gpt-4v responses
    with open("expensive_response_data.json", "w") as json_file:
        json.dump(response_data, json_file)
else:
    # load up previously saved image descriptions
    with open("asl_data/expensive_response_data.json") as json_file:
        response_data = json.load(json_file)
import time import tqdm if not load_previous_responses: response_data = [] for letter in tqdm.tqdm(asl_text_descriptions.keys()): data_entry = {} query = QUERY_STR_TEMPLATE.format(symbol=letter) data_entry["query"] = query responses = {} for name, engine in rag_engines.items(): this_response = {} result = engine.query(query) this_response["response"] = result.response sources = {} source_image_nodes = [] source_text_nodes = [] # image sources source_image_nodes = [ score_img_node.node.metadata["file_path"] for score_img_node in result.metadata["image_nodes"] ] # text sources source_text_nodes = [ score_text_node.node.text for score_text_node in result.metadata["text_nodes"] ] sources["images"] = source_image_nodes sources["texts"] = source_text_nodes this_response["sources"] = sources responses[name] = this_response data_entry["responses"] = responses response_data.append(data_entry) # save expensive gpt-4v responses with open("expensive_response_data.json", "w") as json_file: json.dump(response_data, json_file) else: # load up previously saved image descriptions with open("asl_data/expensive_response_data.json") as json_file: response_data = json.load(json_file)

正确性、忠实度和相关性¶

拿到生成的响应（存储在一个为 ASL 用例量身定制的自定义数据对象中，即：response_data）后，我们现在可以计算其评估指标：

正确性 (LLM 作为评判者)
忠实度 (LMM 作为评判者)
相关性 (LMM 作为评判者)

为了计算所有这三个指标，我们提示另一个生成模型，以提供评估其各自标准的得分。对于正确性，由于我们不考虑上下文，因此评判者是 LLM。相比之下，计算忠实度和相关性需要传入上下文，这意味着需要同时传入图像和文本，这些都是最初提供给 RAG 以生成响应的内容。由于需要同时传入图像和文本，忠实度和相关性的评判者必须是 LMM（或多模态 LLM）。

我们的 evaluation 模块中有这些抽象，并将演示如何在循环遍历所有生成的响应时使用它们。

输入 [ ]

已复制!





from llama_index.llms.openai import OpenAI
from llama_index.core.evaluation import CorrectnessEvaluator
from llama_index.core.evaluation.multi_modal import (
    MultiModalRelevancyEvaluator,
    MultiModalFaithfulnessEvaluator,
)

import os

judges = {}

judges["correctness"] = CorrectnessEvaluator(
    llm=OpenAI(temperature=0, model="gpt-4"),
)

judges["relevancy"] = MultiModalRelevancyEvaluator(
    multi_modal_llm=OpenAIMultiModal(
        model="gpt-4o",
        max_new_tokens=300,
    )
)

judges["faithfulness"] = MultiModalFaithfulnessEvaluator(
    multi_modal_llm=OpenAIMultiModal(
        model="gpt-4o",
        max_new_tokens=300,
    )
)
from llama_index.llms.openai import OpenAI from llama_index.core.evaluation import CorrectnessEvaluator from llama_index.core.evaluation.multi_modal import ( MultiModalRelevancyEvaluator, MultiModalFaithfulnessEvaluator, ) import os judges = {} judges["correctness"] = CorrectnessEvaluator( llm=OpenAI(temperature=0, model="gpt-4"), ) judges["relevancy"] = MultiModalRelevancyEvaluator( multi_modal_llm=OpenAIMultiModal( model="gpt-4o", max_new_tokens=300, ) ) judges["faithfulness"] = MultiModalFaithfulnessEvaluator( multi_modal_llm=OpenAIMultiModal( model="gpt-4o", max_new_tokens=300, ) )

输入 [ ]

已复制!





#######################################################################
## This section of the notebook can make a total of ~200 GPT-4V      ##
## which is heavily rate limited (100 per day). To follow along,     ##
## with previous generated evaluations set load_previous_evaluations ##
## to True. To test out the evaluation execution, set number_evals   ##
## to any number between (1-27). The json is part of the .zip        ##
## download                                                          ##
#######################################################################

load_previous_evaluations = True
number_evals = 27
####################################################################### ## 笔记本的这一部分总共可以调用约 200 次 GPT-4V ## ## 模型的速率限制非常严格（每天 100 次）。要跟随进行 ## ## 使用先前生成的评估，请将 load_previous_evaluations ## ## 设置为 True。要测试评估执行，请将 number_evals ## ## 设置为 1 到 27 之间的任何数字。该 JSON 文件是 .zip ## ## 下载的一部分 ## ####################################################################### load_previous_evaluations = True number_evals = 27

输入 [ ]

已复制!





if not load_previous_evaluations:
    evals = {
        "names": [],
        "correctness": [],
        "relevancy": [],
        "faithfulness": [],
    }

    # loop through all responses and evaluate them
    for data_entry in tqdm.tqdm(response_data[:number_evals]):
        reg_ex = r"(?:How can I sign a ([A-Z]+)?)"
        match = re.search(reg_ex, data_entry["query"])

        batch_names = []
        batch_correctness = []
        batch_relevancy = []
        batch_faithfulness = []
        if match:
            letter = match.group(1)
            reference_answer = human_answers[letter]
            for rag_name, rag_response_data in data_entry["responses"].items():
                correctness_result = await judges["correctness"].aevaluate(
                    query=data_entry["query"],
                    response=rag_response_data["response"],
                    reference=reference_answer,
                )

                relevancy_result = judges["relevancy"].evaluate(
                    query=data_entry["query"],
                    response=rag_response_data["response"],
                    contexts=rag_response_data["sources"]["texts"],
                    image_paths=rag_response_data["sources"]["images"],
                )

                faithfulness_result = judges["faithfulness"].evaluate(
                    query=data_entry["query"],
                    response=rag_response_data["response"],
                    contexts=rag_response_data["sources"]["texts"],
                    image_paths=rag_response_data["sources"]["images"],
                )

                batch_names.append(rag_name)
                batch_correctness.append(correctness_result)
                batch_relevancy.append(relevancy_result)
                batch_faithfulness.append(faithfulness_result)

            evals["names"] += batch_names
            evals["correctness"] += batch_correctness
            evals["relevancy"] += batch_relevancy
            evals["faithfulness"] += batch_faithfulness

    # save evaluations
    evaluations_objects = {
        "names": evals["names"],
        "correctness": [e.dict() for e in evals["correctness"]],
        "faithfulness": [e.dict() for e in evals["faithfulness"]],
        "relevancy": [e.dict() for e in evals["relevancy"]],
    }
    with open("asl_data/evaluations.json", "w") as json_file:
        json.dump(evaluations_objects, json_file)
else:
from llama_index.core.evaluation import EvaluationResult

    # load up previously saved image descriptions
    with open("asl_data/evaluations.json") as json_file:
        evaluations_objects = json.load(json_file)

    evals = {}
    evals["names"] = evaluations_objects["names"]
    evals["correctness"] = [
        EvaluationResult.parse_obj(e)
        for e in evaluations_objects["correctness"]
    ]
    evals["faithfulness"] = [
        EvaluationResult.parse_obj(e)
        for e in evaluations_objects["faithfulness"]
    ]
    evals["relevancy"] = [
        EvaluationResult.parse_obj(e) for e in evaluations_objects["relevancy"]
    ]
if not load_previous_evaluations: evals = { "names": [], "correctness": [], "relevancy": [], "faithfulness": [], } # loop through all responses and evaluate them for data_entry in tqdm.tqdm(response_data[:number_evals]): reg_ex = r"(?:How can I sign a ([A-Z]+)?)" match = re.search(reg_ex, data_entry["query"]) batch_names = [] batch_correctness = [] batch_relevancy = [] batch_faithfulness = [] if match: letter = match.group(1) reference_answer = human_answers[letter] for rag_name, rag_response_data in data_entry["responses"].items(): correctness_result = await judges["correctness"].aevaluate( query=data_entry["query"], response=rag_response_data["response"], reference=reference_answer, ) relevancy_result = judges["relevancy"].evaluate( query=data_entry["query"], response=rag_response_data["response"], contexts=rag_response_data["sources"]["texts"], image_paths=rag_response_data["sources"]["images"], ) faithfulness_result = judges["faithfulness"].evaluate( query=data_entry["query"], response=rag_response_data["response"], contexts=rag_response_data["sources"]["texts"], image_paths=rag_response_data["sources"]["images"], ) batch_names.append(rag_name) batch_correctness.append(correctness_result) batch_relevancy.append(relevancy_result) batch_faithfulness.append(faithfulness_result) evals["names"] += batch_names evals["correctness"] += batch_correctness evals["relevancy"] += batch_relevancy evals["faithfulness"] += batch_faithfulness # save evaluations evaluations_objects = { "names": evals["names"], "correctness": [e.dict() for e in evals["correctness"]], "faithfulness": [e.dict() for e in evals["faithfulness"]], "relevancy": [e.dict() for e in evals["relevancy"]], } with open("asl_data/evaluations.json", "w") as json_file: json.dump(evaluations_objects, json_file) else: from llama_index.core.evaluation import EvaluationResult # load up previously saved image descriptions with open("asl_data/evaluations.json") as json_file: evaluations_objects = json.load(json_file) evals = {} evals["names"] = evaluations_objects["names"] evals["correctness"] = [ EvaluationResult.parse_obj(e) for e in evaluations_objects["correctness"] ] evals["faithfulness"] = [ EvaluationResult.parse_obj(e) for e in evaluations_objects["faithfulness"] ] evals["relevancy"] = [ EvaluationResult.parse_obj(e) for e in evaluations_objects["relevancy"] ]

为了查看这些结果，我们再次使用笔记本实用函数 get_eval_results_df。

输入 [ ]

已复制!





from llama_index.core.evaluation.notebook_utils import get_eval_results_df

deep_eval_df, mean_correctness_df = get_eval_results_df(
    evals["names"], evals["correctness"], metric="correctness"
)
_, mean_relevancy_df = get_eval_results_df(
    evals["names"], evals["relevancy"], metric="relevancy"
)
_, mean_faithfulness_df = get_eval_results_df(
    evals["names"], evals["faithfulness"], metric="faithfulness"
)

mean_scores_df = pd.concat(
    [
        mean_correctness_df.reset_index(),
        mean_relevancy_df.reset_index(),
        mean_faithfulness_df.reset_index(),
    ],
    axis=0,
    ignore_index=True,
)
mean_scores_df = mean_scores_df.set_index("index")
mean_scores_df.index = mean_scores_df.index.set_names(["metrics"])
from llama_index.core.evaluation.notebook_utils import get_eval_results_df deep_eval_df, mean_correctness_df = get_eval_results_df( evals["names"], evals["correctness"], metric="correctness" ) _, mean_relevancy_df = get_eval_results_df( evals["names"], evals["relevancy"], metric="relevancy" ) _, mean_faithfulness_df = get_eval_results_df( evals["names"], evals["faithfulness"], metric="faithfulness" ) mean_scores_df = pd.concat( [ mean_correctness_df.reset_index(), mean_relevancy_df.reset_index(), mean_faithfulness_df.reset_index(), ], axis=0, ignore_index=True, ) mean_scores_df = mean_scores_df.set_index("index") mean_scores_df.index = mean_scores_df.index.set_names(["metrics"])

输入 [ ]

已复制!

print(deep_eval_df[:4])
print(deep_eval_df[:4])

输出 [ ]

	RAG	查询	分数	反馈
0	mm_clip_gpt4v	如何签署字母 A？.	4.500000	生成的答案是相关的，且大部分是正确的。它准确地描述了如何在 ASL 中签署字母“A”，这与用户查询相符。然而，它包含了一些关于图像的不必要信息，而这些图像在用户查询中并未提及，这稍微降低了其整体正确性。
1	mm_clip_llava	如何签署字母 A？.	4.500000	生成的答案是相关的，且大部分是正确的。它提供了在 ASL 中签署字母“A”的必要步骤，但缺少了参考答案中提供的关于手部位置以及“A”和“S”之间差异的额外信息。
2	mm_text_desc_gpt4v	如何签署字母 A？.	4.500000	生成的答案是相关的，且大部分是正确的。它清晰地描述了如何在美国手语中签署字母“A”，这与参考答案相符。然而，它以一句关于缺少图像的不必要的话开头，这与用户的查询不相关。
3	mm_text_desc_llava	如何签署字母 A？.	4.500000	生成的答案是相关的，且几乎完全正确。它准确地描述了如何在美国手语中签署字母“A”。然而，它缺少了参考答案中关于手部位置（肩高，手掌朝外）的细节。

输入 [ ]

已复制!

mean_scores_df
mean_scores_df

输出 [ ]

RAG	mm_clip_gpt4v	mm_clip_llava	mm_text_desc_gpt4v	mm_text_desc_llava
指标
平均正确性分数	3.685185	4.092593	3.722222	3.870370
平均相关性分数	0.777778	0.851852	0.703704	0.740741
平均忠实度分数	0.777778	0.888889	0.851852	0.851852

观察¶

使用 LLaVA 的 RAG 似乎比使用 GPT-4V 的 RAG 在正确性、相关性和忠实度分数方面表现更好
在对响应进行了一些检查后，我们注意到 GPT-4V 在回答 SPACE 时即使图像已正确检索，仍给出了以下答案：“很抱歉，我无法根据提供的图像回答您的查询，因为系统目前不允许我视觉分析图像。但是，根据提供的上下文，要在 ASL 中签署“SPACE”，您应该手掌朝上，手指向上弯曲，拇指向上。”
这类生成的响应可能是评判者未能将 GPT-4V 生成的答案评分高于 LLaVA 的原因。更深入的分析将需要更仔细地研究生成的响应，并可能需要调整生成提示甚至评估提示。

总结¶

在本笔记本中，我们演示了如何评估多模态 RAG 的检索器和生成器。具体来说，我们将现有的 llama-index 评估工具应用于 ASL 用例，以说明如何将它们应用于您的评估需求。请注意，多模态 LLM 仍应视为测试版，如果它们用于生产系统以评估多模态响应，则应应用特殊的护理标准。

评估多模态 RAG¶

用例：ASL 中的拼写¶

查询¶

数据集¶

另一个需要考虑的 RAG 系统（用于检索的 GPT-4V 图像描述）¶

构建我们的多模态 RAG 系统¶

测试我们的多模态 RAG¶

检索器评估¶

可视化¶

定量分析：命中率和 MRR¶

观察¶

生成评估¶

参考（地面真实）数据¶

为每个系统生成所有查询的响应¶

正确性、忠实度和相关性¶

观察¶

总结¶