RAGChecker：用于诊断RAG的细粒度评估框架¶

RAGChecker 是一个专为检索增强生成 (RAG) 系统设计的全面评估框架。它提供一套指标来评估 RAG 系统的检索和生成组件，从而深入了解其性能。

RAGChecker 的主要特性包括

使用声明级蕴含检查进行细粒度分析
用于整体性能、检索器效率和生成器准确性的全面指标
改进 RAG 系统的可操作洞察

更多信息，请访问 RAGChecker GitHub 仓库。

RAGChecker 指标¶

RAGChecker 提供一套全面的指标来评估 RAG 系统的不同方面

总体指标
- Precision（精确率）：模型响应中正确声明的比例。
- Recall（召回率）：模型响应覆盖的真实（ground truth）声明的比例。
- F1 Score（F1 分数）：精确率和召回率的调和平均数。
检索器指标
- Claim Recall（声明召回率）：检索到的文本块覆盖的真实（ground truth）声明的比例。
- Context Precision（上下文精确率）：检索到的相关文本块的比例。
生成器指标
- Context Utilization（上下文利用率）：生成器对检索到的文本块中相关信息的利用程度。
- Noise Sensitivity（噪声敏感性）：生成器倾向于包含检索到的文本块中不正确信息的程度。
- Hallucination（幻觉）：未在任何检索到的文本块中找到的不正确声明的比例。
- Self-knowledge（自知）：未在任何检索到的文本块中找到的正确声明的比例。
- Faithfulness（忠实度）：生成器响应与检索到的文本块的一致程度。

这些指标对检索和生成组件提供了细致入微的评估，有助于对 RAG 系统进行有针对性的改进。

安装依赖¶

In [ ]

已复制！

%pip install -qU ragchecker llama-index
%pip install -qU ragchecker llama-index

设置和导入¶

首先，让我们导入所需的库

In [ ]

已复制！

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from ragchecker.integrations.llama_index import response_to_rag_results
from ragchecker import RAGResults, RAGChecker
from ragchecker.metrics import all_metrics
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader from ragchecker.integrations.llama_index import response_to_rag_results from ragchecker import RAGResults, RAGChecker from ragchecker.metrics import all_metrics

创建 LlamaIndex 查询引擎¶

现在，让我们使用示例数据集创建一个简单的 LlamaIndex 查询引擎

In [ ]

已复制！

# Load documents
documents = SimpleDirectoryReader("path/to/your/documents").load_data()

# Create index
index = VectorStoreIndex.from_documents(documents)

# Create query engine
rag_application = index.as_query_engine()
# Load documents documents = SimpleDirectoryReader("path/to/your/documents").load_data() # Create index index = VectorStoreIndex.from_documents(documents) # Create query engine rag_application = index.as_query_engine()

将 RAGChecker 与 LlamaIndex 结合使用¶

现在我们将演示如何使用 response_to_rag_results 函数将 LlamaIndex 输出转换为 RAGChecker 格式

In [ ]

已复制！





# User query and groud truth answer
user_query = "What is RAGChecker?"
gt_answer = "RAGChecker is an advanced automatic evaluation framework designed to assess and diagnose Retrieval-Augmented Generation (RAG) systems. It provides a comprehensive suite of metrics and tools for in-depth analysis of RAG performance."


# Get response from LlamaIndex
response_object = rag_application.query(user_query)

# Convert to RAGChecker format
rag_result = response_to_rag_results(
    query=user_query,
    gt_answer=gt_answer,
    response_object=response_object,
)

# Create RAGResults object
rag_results = RAGResults.from_dict({"results": [rag_result]})
print(rag_results)
# User query and groud truth answer user_query = "What is RAGChecker?" gt_answer = "RAGChecker is an advanced automatic evaluation framework designed to assess and diagnose Retrieval-Augmented Generation (RAG) systems. It provides a comprehensive suite of metrics and tools for in-depth analysis of RAG performance." # Get response from LlamaIndex response_object = rag_application.query(user_query) # Convert to RAGChecker format rag_result = response_to_rag_results( query=user_query, gt_answer=gt_answer, response_object=response_object, ) # Create RAGResults object rag_results = RAGResults.from_dict({"results": [rag_result]}) print(rag_results)

使用 RAGChecker 进行评估¶

现在我们的结果已经转换为正确的格式，接下来使用 RAGChecker 进行评估

In [ ]

已复制！





# Initialize RAGChecker
evaluator = RAGChecker(
    extractor_name="bedrock/meta.llama3-70b-instruct-v1:0",
    checker_name="bedrock/meta.llama3-70b-instruct-v1:0",
    batch_size_extractor=32,
    batch_size_checker=32,
)

# Evaluate using RAGChecker
evaluator.evaluate(rag_results, all_metrics)

# Print detailed results
print(rag_results)
# Initialize RAGChecker evaluator = RAGChecker( extractor_name="bedrock/meta.llama3-70b-instruct-v1:0", checker_name="bedrock/meta.llama3-70b-instruct-v1:0", batch_size_extractor=32, batch_size_checker=32, ) # Evaluate using RAGChecker evaluator.evaluate(rag_results, all_metrics) # Print detailed results print(rag_results)

输出将看起来像这样

RAGResults(
  1 RAG results,
  Metrics:
  {
    "overall_metrics": {
      "precision": 66.7,
      "recall": 27.3,
      "f1": 38.7
    },
    "retriever_metrics": {
      "claim_recall": 54.5,
      "context_precision": 100.0
    },
    "generator_metrics": {
      "context_utilization": 16.7,
      "noise_sensitivity_in_relevant": 0.0,
      "noise_sensitivity_in_irrelevant": 0.0,
      "hallucination": 33.3,
      "self_knowledge": 0.0,
      "faithfulness": 66.7
    }
  }
)

这个输出提供了 RAG 系统性能的全面视图，包括前面部分描述的总体指标、检索器指标和生成器指标。

选择特定的指标组¶

除了使用 all_metrics 评估所有指标外，您还可以按如下方式选择特定的指标组

In [ ]

已复制！

from ragchecker.metrics import (
    overall_metrics,
    retriever_metrics,
    generator_metrics,
)
from ragchecker.metrics import ( overall_metrics, retriever_metrics, generator_metrics, )

选择单个指标¶

为了获得更精细的控制，您可以根据需要选择特定的单个指标

In [ ]

已复制！





from ragchecker.metrics import (
    precision,
    recall,
    f1,
    claim_recall,
    context_precision,
    context_utilization,
    noise_sensitivity_in_relevant,
    noise_sensitivity_in_irrelevant,
    hallucination,
    self_knowledge,
    faithfulness,
)
from ragchecker.metrics import ( precision, recall, f1, claim_recall, context_precision, context_utilization, noise_sensitivity_in_relevant, noise_sensitivity_in_irrelevant, hallucination, self_knowledge, faithfulness, )

结论¶

本 Notebook 演示了如何将 RAGChecker 与 LlamaIndex 集成，以评估 RAG 系统的性能。我们介绍了

使用 LlamaIndex 设置 RAGChecker
将 LlamaIndex 输出转换为 RAGChecker 格式
使用各种指标评估 RAG 结果
使用特定的指标组或单个指标自定义评估

通过利用 RAGChecker 的全面指标，您可以获得关于 RAG 系统性能的宝贵洞察，识别需要改进的领域，并优化检索和生成组件。这种集成提供了一个强大的工具，用于开发和完善更有效的 RAG 应用。