AIMon

AIMon 的 LlamaIndex 扩展，用于评估 LLM 响应¶

本 notebook 介绍了 AIMon 用于 LlamaIndex 框架的评估器，这些评估器旨在评估集成到 LlamaIndex 中的语言模型 (LLMs) 生成的响应的质量和准确性。以下是所有可用评估器的概述

幻觉评估器：检测模型生成与提供上下文不符的信息（幻觉）的情况。
准则评估器：确保模型响应遵循预定义的指令和准则。
完整性评估器：检查响应是否完全解决了查询或任务的所有方面。
简洁性评估器：评估响应是否简明但完整，避免不必要的冗长。
毒性评估器：标记响应中的有害、冒犯性或不当语言。
上下文相关性评估器：评估提供上下文在支持模型响应方面的相关性和准确性。

在本 notebook 中，我们将重点介绍如何利用幻觉评估器、准则评估器和上下文相关性评估器来评估您的 RAG（检索增强生成）应用。

要了解更多关于 AIMon 的信息，请查看以下资源：：网站和文档

先决条件¶

让我们首先安装依赖项并设置 API 密钥。

输入 [ ]

已复制！

%%capture
!pip install requests datasets aimon-llamaindex llama-index-embeddings-openai llama-index-llms-openai
%%capture !pip install requests datasets aimon-llamaindex llama-index-embeddings-openai llama-index-llms-openai

在 Google Collab secrets 中配置您的 OPENAI_API_KEY 和 AIMON_API_KEY，并授予它们 notebook 访问权限。我们将使用 OpenAI 作为 LLM 和嵌入生成模型。我们将使用 AIMon 进行质量问题的持续监控。

AIMon API 密钥可在此处获取：此处。

输入 [ ]

已复制！

import os
import json

# Import Colab Secrets userdata module.
from google.colab import userdata

os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY")
import os import json # Import Colab Secrets userdata module. from google.colab import userdata os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY")

用于评估的数据集¶

在此示例中，我们将使用 MeetingBank 数据集 [1] 中的转录文本作为上下文信息。

输入 [ ]

已复制！

%%capture
from datasets import load_dataset

meetingbank = load_dataset("huuuyeah/meetingbank")
%%capture from datasets import load_dataset meetingbank = load_dataset("huuuyeah/meetingbank")

此函数用于提取转录文本并将其转换为 llama_index.core.Document 类型的对象列表。

输入 [ ]

已复制！

from llama_index.core import Document

def extract_and_create_documents(transcripts):
    documents = []

    for transcript in transcripts:
        try:
            doc = Document(text=transcript)
            documents.append(doc)

        except Exception as e:
            print(f"Failed to create document")

    return documents

transcripts = [meeting["transcript"] for meeting in meetingbank["train"]]
documents = extract_and_create_documents(
    transcripts[:5]
)  ## Using only 5 transcripts to keep this example fast and concise.
from llama_index.core import Document def extract_and_create_documents(transcripts): documents = [] for transcript in transcripts: try: doc = Document(text=transcript) documents.append(doc) except Exception as e: print(f"Failed to create document") return documents transcripts = [meeting["transcript"] for meeting in meetingbank["train"]] documents = extract_and_create_documents( transcripts[:5] ) ## Using only 5 transcripts to keep this example fast and concise.

设置一个嵌入模型。我们将使用 text-embedding-3-small 模型。

输入 [ ]

已复制！

from llama_index.embeddings.openai import OpenAIEmbedding

embedding_model = OpenAIEmbedding(
    model="text-embedding-3-small", embed_batch_size=100, max_retries=3
)
from llama_index.embeddings.openai import OpenAIEmbedding embedding_model = OpenAIEmbedding( model="text-embedding-3-small", embed_batch_size=100, max_retries=3 )

将文档分割成节点并生成其嵌入

输入 [ ]

已复制！

from aimon_llamaindex import generate_embeddings_for_docs

nodes = generate_embeddings_for_docs(documents, embedding_model)
from aimon_llamaindex import generate_embeddings_for_docs nodes = generate_embeddings_for_docs(documents, embedding_model)

将带有嵌入的节点插入到内存中的向量存储索引中。

输入 [ ]

已复制！

from aimon_llamaindex import build_index

index = build_index(nodes)
from aimon_llamaindex import build_index index = build_index(nodes)

实例化一个向量索引检索器

输入 [ ]

已复制！

from aimon_llamaindex import build_retriever

retriever = build_retriever(index, similarity_top_k=5)
from aimon_llamaindex import build_retriever retriever = build_retriever(index, similarity_top_k=5)

构建 LLM 应用¶

配置大型语言模型。这里我们选择 OpenAI 的 gpt-4o-mini 模型，并将温度设置为 0.1。

输入 [ ]

已复制！





## OpenAI's LLM
from llama_index.llms.openai import OpenAI

llm = OpenAI(
    model="gpt-4o-mini",
    temperature=0.4,
    system_prompt="""
                    Please be professional and polite.
                    Answer the user's question in a single line.
                    Even if the context lacks information to answer the question, make
                    sure that you answer the user's question based on your own knowledge.
                    """,
)
## OpenAI's LLM from llama_index.llms.openai import OpenAI llm = OpenAI( model="gpt-4o-mini", temperature=0.4, system_prompt=""" Please be professional and polite. Answer the user's question in a single line. Even if the context lacks information to answer the question, make sure that you answer the user's question based on your own knowledge. """, )

定义您的查询和指令

输入 [ ]

已复制！

user_query = "Which council bills were amended for zoning regulations?"
user_instructions = [
    "Keep the response concise, preferably under the 100 word limit."
]
user_query = "Which council bills were amended for zoning regulations?" user_instructions = [ "Keep the response concise, preferably under the 100 word limit." ]

使用动态定义的用户指令更新 LLM 的系统提示

输入 [ ]

已复制！

llm.system_prompt += (
    f"Please comply to the following instructions {user_instructions}."
)
llm.system_prompt += ( f"Please comply to the following instructions {user_instructions}." )

检索查询的响应。

输入 [ ]

已复制！

from aimon_llamaindex import get_response

llm_response = get_response(user_query, retriever, llm)
from aimon_llamaindex import get_response llm_response = get_response(user_query, retriever, llm)

使用 AIMon 运行评估¶

配置 AIMon 客户端

输入 [ ]

已复制！

from aimon import Client

aimon_client = Client(
    auth_header="Bearer {}".format(userdata.get("AIMON_API_KEY"))
)
from aimon import Client aimon_client = Client( auth_header="Bearer {}".format(userdata.get("AIMON_API_KEY")) )

使用 AIMon 的指令遵循模型（又称准则评估器）

此模型评估生成的文本是否遵循给定指令，确保 LLMs 在各种任务中遵循用户的准则和意图，从而生成更准确和相关的输出。

输入 [ ]

已复制！

from aimon_llamaindex.evaluators import GuidelineEvaluator

guideline_evaluator = GuidelineEvaluator(aimon_client)
evaluation_result = guideline_evaluator.evaluate(
    user_query, llm_response, user_instructions
)
from aimon_llamaindex.evaluators import GuidelineEvaluator guideline_evaluator = GuidelineEvaluator(aimon_client) evaluation_result = guideline_evaluator.evaluate( user_query, llm_response, user_instructions )

输入 [ ]

已复制！

print(json.dumps(evaluation_result, indent=4))
print(json.dumps(evaluation_result, indent=4))

{
    "extractions": [],
    "instructions_list": [
        {
            "explanation": "",
            "follow_probability": 0.982,
            "instruction": "Keep the response concise, preferably under the 100 word limit.",
            "label": true
        }
    ],
    "score": 1.0
}

使用 AIMon 的幻觉检测评估器模型 (HDM-2)

AIMon 的 HDM-2 检测 LLM 输出中的幻觉内容。它提供一个“幻觉得分”（0.0–1.0），量化事实不准确或捏造信息的可能性，确保响应更可靠、更准确。

输入 [ ]

已复制！

from aimon_llamaindex.evaluators import HallucinationEvaluator

hallucination_evaluator = HallucinationEvaluator(aimon_client)
evalution_result = hallucination_evaluator.evaluate(user_query, llm_response)
from aimon_llamaindex.evaluators import HallucinationEvaluator hallucination_evaluator = HallucinationEvaluator(aimon_client) evalution_result = hallucination_evaluator.evaluate(user_query, llm_response)

输入 [ ]

已复制！

## Printing the initial evaluation result for Hallucination
print(json.dumps(evalution_result, indent=4))
## Printing the initial evaluation result for Hallucination print(json.dumps(evalution_result, indent=4))

{
    "is_hallucinated": "False",
    "score": 0.22446,
    "sentences": [
        {
            "score": 0.22446,
            "text": "The council bills amended for zoning regulations include the small lot moratorium and the text amendment related to off-street parking exemptions for preexisting small lots. These amendments aim to balance the interests of local neighborhoods, health institutions, and developers."
        }
    ]
}

使用 AIMon 的上下文相关性评估器评估 LLM 用于生成响应的上下文数据的相关性。

输入 [ ]

已复制！





from aimon_llamaindex.evaluators import ContextRelevanceEvaluator

evaluator = ContextRelevanceEvaluator(aimon_client)
task_definition = (
    "Find the relevance of the context data used to generate this response."
)
evaluation_result = evaluator.evaluate(
    user_query, llm_response, task_definition
)
from aimon_llamaindex.evaluators import ContextRelevanceEvaluator evaluator = ContextRelevanceEvaluator(aimon_client) task_definition = ( "Find the relevance of the context data used to generate this response." ) evaluation_result = evaluator.evaluate( user_query, llm_response, task_definition )

输入 [ ]

已复制！

print(json.dumps(evaluation_result, indent=4))
print(json.dumps(evaluation_result, indent=4))

[
    {
        "explanations": [
            "Document 1 discusses a council bill related to zoning regulations, specifically mentioning a text amendment that aims to balance neighborhood interests with developer needs. However, it primarily focuses on parking issues and personal experiences rather than detailing specific zoning regulation amendments or the council bills directly related to them, which makes it less relevant to the query.",
            "2. Document 2 mentions zoning and development issues, including the need for mass transit and affordability, but it does not provide specific information on which council bills were amended for zoning regulations. The discussion is more about general concerns regarding development and transportation rather than direct references to zoning amendments.",
            "3. Document 3 touches on zoning laws and amendments but does not specify which council bills were amended for zoning regulations. While it discusses the context of zoning and housing, it lacks concrete details that directly answer the query about specific bills.",
            "4. Document 4 discusses broader issues about affordable housing and transportation without directly addressing any specific council bills or amendments related to zoning regulations. The focus is on general priorities and funding rather than specific legislative changes, making it less relevant to the query.",
            "5. Document 5 mentions support for a zoning code amendment regarding parking exemptions for small lots, which is somewhat related to zoning regulations. However, it does not provide specific details about the council bills amended for zoning regulations, thus failing to fully address the query."
        ],
        "query": "Which council bills were amended for zoning regulations?",
        "relevance_scores": [
            40.5,
            40.25,
            44.25,
            38.5,
            43.0
        ]
    }
]

结论¶

在本 notebook 中，我们使用 LlamaIndex 框架构建了一个简单的 RAG 应用。在检索到查询的响应后，我们使用 AIMon 的评估器对其进行了评估。

参考文献¶

[1]. Y. Hu, T. Ganter, H. Deilamsalehy, F. Dernoncourt, H. Foroosh, and F. Liu, "MeetingBank: A Benchmark Dataset for Meeting Summarization," arXiv, May 2023. [Online]. Available: https://arxiv.org/abs/2305.17529. Accessed: Jan. 16, 2025.