使用可信语言模型的可靠 RAG¶

本教程演示了如何在任何 RAG 系统中使用 Cleanlab 的可信语言模型 (TLM)，以对答案的可信度进行评分并提高 RAG 系统的整体可靠性。我们建议首先完成 TLM 示例教程。

检索增强生成 (RAG) 已成为在单独使用 LLM 会遇到幻觉、知识空白和事实不准确等问题的领域构建基于 LLM 的问答系统的流行方法。然而，RAG 系统通常仍然会产生不可靠的响应，因为它们依赖于本质上不可靠的 LLM。Cleanlab 的可信语言模型 (TLM) 提供了一种解决方案，通过提供可信度评分来评估和提高响应质量，而无需考虑您的 RAG 架构或检索和索引过程。

要诊断何时不能信任 RAG 答案，只需将基于检索到的上下文生成答案的现有 LLM 替换为 TLM 即可。本 notebook 展示了如何在一个标准 RAG 系统中实现这一点，该系统基于流行框架 LlamaIndex 中的一个教程。我们在此仅用 TLM 替换 LlamaIndex 教程中使用的 LLM，并展示其一些优势。TLM 也可以类似地集成到任何其他 RAG 框架中。

设置¶

RAG 的核心是将 LLM 与数据连接起来，以更好地支持其答案。本教程使用 Nvidia 的 Q1 FY2024 财报作为示例数据集。使用以下命令下载数据（财报）并将其存储在名为 data/ 的目录中。

In [ ]

已复制!

!wget -nc 'https://cleanlab-public.s3.amazonaws.com/Datasets/NVIDIA_Financial_Results_Q1_FY2024.md'
!mkdir -p ./data
!mv NVIDIA_Financial_Results_Q1_FY2024.md data/
!wget -nc 'https://cleanlab-public.s3.amazonaws.com/Datasets/NVIDIA_Financial_Results_Q1_FY2024.md' !mkdir -p ./data !mv NVIDIA_Financial_Results_Q1_FY2024.md data/

现在我们安装所需的依赖项。

In [ ]

已复制!

%pip install llama-index-llms-cleanlab llama-index llama-index-embeddings-huggingface
%pip install llama-index-llms-cleanlab llama-index llama-index-embeddings-huggingface

然后我们初始化 Cleanlab 的 TLM。在这里，我们使用默认设置初始化一个 CleanlabTLM 对象。

您可以在此处获取 Cleanlab API 密钥：https://app.cleanlab.ai/account，创建帐户后即可。有关详细说明，请参阅此指南。

In [ ]

已复制!

from llama_index.llms.cleanlab import CleanlabTLM

# set api key in env or in llm
# import os
# os.environ["CLEANLAB_API_KEY"] = "your api key"

llm = CleanlabTLM(api_key="your_api_key")
from llama_index.llms.cleanlab import CleanlabTLM # 在环境变量或 llm 中设置 api 密钥 # import os # os.environ["CLEANLAB_API_KEY"] = "your api key" llm = CleanlabTLM(api_key="your_api_key")

注意：如果在上述导入过程中遇到 ValidationError，请将您的 Python 版本升级到 >= 3.11

您可以通过尝试本进阶 TLM 教程中概述的 TLM 配置来获得更好的结果。

例如，如果您的应用程序需要 OpenAI 的 GPT-4 模型并将输出 token 限制为 256，您可以使用 options 参数进行配置

options = {
    "model": "gpt-4",
    "max_tokens": 128,
}
llm = CleanlabTLM(api_key="your_api_key", options=options)

我们先问 LLM 一个简单的问题。

In [ ]

已复制!

response = llm.complete("What is NVIDIA's ticker symbol?")
print(response)
response = llm.complete("What is NVIDIA's ticker symbol?") print(response)

NVIDIA's ticker symbol is NVDA.

TLM 不仅提供响应，还包含一个可信度评分，表明对该响应良好/准确的信心。您可以直接从响应中访问此评分。

In [ ]

已复制!

response.additional_kwargs
response.additional_kwargs

Out[ ]

{'trustworthiness_score': 0.9884868983475051}

使用 TLM 构建 RAG 管道¶

现在我们将 TLM 集成到 RAG 管道中。

In [ ]

已复制!

from llama_index.core import Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

Settings.llm = llm
from llama_index.core import Settings from llama_index.embeddings.huggingface import HuggingFaceEmbedding from llama_index.core import VectorStoreIndex, SimpleDirectoryReader Settings.llm = llm

指定嵌入模型¶

RAG 使用嵌入模型将查询与文档块匹配，以检索最相关的数据。在这里，我们选择 Hugging Face 提供的一个免费的本地嵌入模型。您可以参考此LlamaIndex 指南使用任何其他嵌入模型。

In [ ]

已复制!

Settings.embed_model = HuggingFaceEmbedding(
    model_name="BAAI/bge-small-en-v1.5"
)
Settings.embed_model = HuggingFaceEmbedding( model_name="BAAI/bge-small-en-v1.5" )

加载数据并创建索引 + 查询引擎¶

让我们从存储在 data 目录中的文档创建索引。系统可以索引同一文件夹中的多个文件，尽管在本教程中，我们将只使用一个文档。本教程中我们坚持使用 LlamaIndex 的默认索引。

In [ ]

已复制!





documents = SimpleDirectoryReader("data").load_data()
# Optional step since we're loading just one data file
for doc in documents:
    doc.excluded_llm_metadata_keys.append(
        "file_path"
    )  # file_path wouldn't be a useful metadata to add to LLM's context since our datasource contains just 1 file
index = VectorStoreIndex.from_documents(documents)
documents = SimpleDirectoryReader("data").load_data() # 可选步骤，因为我们只加载一个数据文件 for doc in documents: doc.excluded_llm_metadata_keys.append( "file_path" ) # file_path 不是添加到 LLM 上下文中有用的元数据，因为我们的数据源只包含一个文件 index = VectorStoreIndex.from_documents(documents)

生成的索引用于为数据上的查询引擎提供支持。

In [ ]

已复制!

query_engine = index.as_query_engine()
query_engine = index.as_query_engine()

请注意，TLM 与 RAG 中使用的索引和查询引擎无关，并且与您为这些系统组件做出的任何选择兼容。

此外，您可以在现有的自定义构建的 RAG 管道中（使用任何其他 LLM 生成器，无论是否流式传输）直接使用 TLM 的可信度评分。
为此，您需要获取发送到 LLM 的 Prompt（包括系统指令、检索到的上下文、用户查询等）以及返回的响应。TLM 需要这两者来预测可信度。

有关此方法的详细信息以及示例代码，请参见此处。

从 LLM 响应中提取可信度评分¶

如上所示，Cleanlab 的 TLM 在其对 Prompt 的响应中除了文本外，还提供了 trustworthiness_score。

当在 RAG 管道中使用 TLM 时，要获取此评分，Llamaindex 提供了一个仪表化工具，允许我们观察 RAG 后台运行的事件。
我们可以利用此工具从 LLM 的响应中提取 trustworthiness_score。

让我们定义一个简单的事件处理器，它为发送给 LLM 的每个请求存储此评分。有关仪表化的更多详细信息，您可以参考Llamaindex 的文档。

In [ ]

已复制!





from typing import Dict, List, ClassVar
from llama_index.core.instrumentation.events import BaseEvent
from llama_index.core.instrumentation.event_handlers import BaseEventHandler
from llama_index.core.instrumentation import get_dispatcher
from llama_index.core.instrumentation.events.llm import LLMCompletionEndEvent


class GetTrustworthinessScore(BaseEventHandler):
    events: ClassVar[List[BaseEvent]] = []
    trustworthiness_score: float = 0.0

    @classmethod
    def class_name(cls) -> str:
        """Class name."""
        return "GetTrustworthinessScore"

    def handle(self, event: BaseEvent) -> Dict:
        if isinstance(event, LLMCompletionEndEvent):
            self.trustworthiness_score = event.response.additional_kwargs[
                "trustworthiness_score"
            ]
            self.events.append(event)


# Root dispatcher
root_dispatcher = get_dispatcher()

# Register event handler
event_handler = GetTrustworthinessScore()
root_dispatcher.add_event_handler(event_handler)
from typing import Dict, List, ClassVar from llama_index.core.instrumentation.events import BaseEvent from llama_index.core.instrumentation.event_handlers import BaseEventHandler from llama_index.core.instrumentation import get_dispatcher from llama_index.core.instrumentation.events.llm import LLMCompletionEndEvent class GetTrustworthinessScore(BaseEventHandler): events: ClassVar[List[BaseEvent]] = [] trustworthiness_score: float = 0.0 @classmethod def class_name(cls) -> str: """类名。""" return "GetTrustworthinessScore" def handle(self, event: BaseEvent) -> Dict: if isinstance(event, LLMCompletionEndEvent): self.trustworthiness_score = event.response.additional_kwargs[ "trustworthiness_score" ] self.events.append(event) # 根调度器 root_dispatcher = get_dispatcher() # 注册事件处理器 event_handler = GetTrustworthinessScore() root_dispatcher.add_event_handler(event_handler)

对于每个查询，我们可以从 event_handler.trustworthiness_score 获取此评分。让我们看看它的实际效果。

使用我们的 RAG 系统回答查询¶

让我们试用基于 TLM 的 RAG 管道。在这里，我们提出不同复杂程度的问题。

In [ ]

已复制!

# Optional: Define `display_response` helper function

# This method presents formatted responses from our TLM-based RAG pipeline. It parses the output to display both the text response itself and the corresponding trustworthiness score.
def display_response(response):
    response_str = response.response
    trustworthiness_score = event_handler.trustworthiness_score
    print(f"Response: {response_str}")
    print(f"Trustworthiness score: {round(trustworthiness_score, 2)}")
# 可选：定义 `display_response` 助手函数 # 此方法显示我们基于 TLM 的 RAG 管道的格式化响应。它解析输出以显示文本响应本身和相应的可信度评分。 def display_response(response): response_str = response.response trustworthiness_score = event_handler.trustworthiness_score print(f"Response: {response_str}") print(f"Trustworthiness score: {round(trustworthiness_score, 2)}")

简单问题¶

我们首先提出可以直接由提供的数据回答并且可以在几行文本中轻松找到的简单问题。

In [ ]

已复制!

response = query_engine.query(
    "What was NVIDIA's total revenue in the first quarter of fiscal 2024?"
)
display_response(response)
response = query_engine.query( "What was NVIDIA's total revenue in the first quarter of fiscal 2024?" ) display_response(response)

Response: NVIDIA's total revenue in the first quarter of fiscal 2024 was $7.19 billion.
Trustworthiness score: 1.0

In [ ]

已复制!

response = query_engine.query(
    "What was the GAAP earnings per diluted share for the quarter?"
)
display_response(response)
response = query_engine.query( "What was the GAAP earnings per diluted share for the quarter?" ) display_response(response)

Response: The GAAP earnings per diluted share for the quarter (Q1 FY24) was $0.82.
Trustworthiness score: 0.99

In [ ]

已复制!

response = query_engine.query(
    "What significant transitions did Jensen Huang, NVIDIA's CEO, comment on?"
)
display_response(response)
response = query_engine.query( "What significant transitions did Jensen Huang, NVIDIA's CEO, comment on?" ) display_response(response)

Response: Jensen Huang, NVIDIA's CEO, commented on the significant transitions the computer industry is undergoing, particularly in the areas of accelerated computing and generative AI.
Trustworthiness score: 0.99

TLM 为这些响应返回高可信度评分，表明对它们准确性有高度信心。快速进行事实核查（审查原始财报）后，我们可以确认 TLM 确实准确地回答了这些问题。如果您好奇，以下是这些问题相关的数据上下文的摘录：

NVIDIA (NASDAQ: NVDA) today reported revenue for the first quarter ended April 30, 2023, of $7.19 billion, ...

GAAP earnings per diluted share for the quarter were $0.82, up 28% from a year ago and up 44% from the previous quarter.

Jensen Huang, founder and CEO of NVIDIA, commented on the significant transitions the computer industry is undergoing, particularly accelerated computing and generative AI, ...

没有可用上下文的问题¶

现在让我们看看 TLM 如何响应无法使用提供的数据回答的查询。

In [ ]

已复制!

response = query_engine.query(
    "What factors as per the report were responsible to the decline in NVIDIA's proviz revenue?"
)
display_response(response)
response = query_engine.query( "What factors as per the report were responsible to the decline in NVIDIA's proviz revenue?" ) display_response(response)

Response: The report indicates that NVIDIA's professional visualization revenue declined by 53% year-over-year. While the specific factors contributing to this decline are not detailed in the provided information, several potential reasons can be inferred:

1. **Market Conditions**: The overall market for professional visualization may have faced challenges, leading to reduced demand for NVIDIA's products in this segment.

2. **Increased Competition**: The presence of competitors in the professional visualization space could have impacted NVIDIA's market share and revenue.

3. **Economic Factors**: Broader economic conditions, such as inflation or reduced spending in industries that utilize professional visualization tools, may have contributed to the decline.

4. **Transition to New Technologies**: The introduction of new technologies, such as the NVIDIA Omniverse™ Cloud, may have shifted focus away from traditional professional visualization products, affecting revenue.

5. **Product Lifecycle**: If certain products were nearing the end of their lifecycle or if there were delays in new product launches, this could have impacted sales.

Overall, while the report does not specify the exact reasons for the decline, these factors could be contributing elements based on industry trends and market dynamics.
Trustworthiness score: 0.76

较低的 TLM 可信度评分表明对响应存在更多不确定性，这与缺乏可用信息一致。让我们尝试更多问题。

In [ ]

已复制!

response = query_engine.query(
    "How does the report explain why NVIDIA's Gaming revenue decreased year over year?"
)
display_response(response)
response = query_engine.query( "How does the report explain why NVIDIA's Gaming revenue decreased year over year?" ) display_response(response)

Response: The report does not explicitly explain the reasons for the year-over-year decrease in NVIDIA's Gaming revenue. However, it does provide context regarding the overall performance of the gaming segment, noting that first-quarter revenue was $2.24 billion, which is down 38% from a year ago but up 22% from the previous quarter. This suggests that while there may have been a decline compared to the same period last year, there was a recovery compared to the previous quarter. Factors that could contribute to the year-over-year decline might include market conditions, competition, or changes in consumer demand, but these specifics are not detailed in the report.
Trustworthiness score: 0.92

In [ ]

已复制!

response = query_engine.query(
    "How does NVIDIA's dividend payout for this quarter compare to the industry average?",
)
display_response(response)
response = query_engine.query( "How does NVIDIA's dividend payout for this quarter compare to the industry average?", ) display_response(response)

Response: The context information provided does not include specific details about the industry average for dividend payouts. Therefore, I cannot directly compare NVIDIA's dividend payout for this quarter to the industry average. However, NVIDIA announced a quarterly cash dividend of $0.04 per share for shareholders of record on June 8, 2023. To assess how this compares to the industry average, one would need to look up the average dividend payout for similar companies in the technology or semiconductor industry.
Trustworthiness score: 0.93

我们观察到 TLM 展示了识别可用信息局限性的能力。它避免生成猜测性响应或幻觉，从而维护了问答系统的可靠性。这种行为展示了对上下文边界的理解，并优先考虑准确性而非推测。

难题¶

让我们看看我们的 RAG 系统如何应对更难的问题，其中一些可能具有误导性。

In [ ]

已复制!

response = query_engine.query(
    "How much did Nvidia's revenue decrease this quarter vs last quarter, in terms of $?"
)
display_response(response)
response = query_engine.query( "How much did Nvidia's revenue decrease this quarter vs last quarter, in terms of $?" ) display_response(response)

Response: NVIDIA's revenue for the first quarter of fiscal 2024 was $7.19 billion, and it was reported that this revenue was up 19% from the previous quarter. To find the revenue for the previous quarter, we can use the following calculation:

Let \( x \) be the revenue for the previous quarter. 

The equation based on the 19% increase is:
\[ 
x + 0.19x = 7.19 \text{ billion} 
\]
\[ 
1.19x = 7.19 \text{ billion} 
\]
\[ 
x = \frac{7.19 \text{ billion}}{1.19} \approx 6.04 \text{ billion} 
\]

Now, to find the decrease in revenue from the previous quarter to this quarter:
\[ 
\text{Decrease} = 7.19 \text{ billion} - 6.04 \text{ billion} \approx 1.15 \text{ billion} 
\]

Thus, NVIDIA's revenue decreased by approximately $1.15 billion this quarter compared to the last quarter.
Trustworthiness score: 0.6

In [ ]

已复制!

response = query_engine.query(
    "This report focuses on Nvidia's Q1FY2024 financial results. There are mentions of other companies in the report like Microsoft, Dell, ServiceNow, etc. Can you name them all here?",
)
display_response(response)
response = query_engine.query( "This report focuses on Nvidia's Q1FY2024 financial results. There are mentions of other companies in the report like Microsoft, Dell, ServiceNow, etc. Can you name them all here?", ) display_response(response)

Response: The report mentions the following companies: Microsoft and Dell. ServiceNow is also mentioned in the context, but it is not specified in the provided highlights. Therefore, the companies explicitly mentioned in the report are Microsoft and Dell.
Trustworthiness score: 0.6

In [ ]

已复制!

response = query_engine.query(
    "How many RTX GPU models, including all custom versions released by third-party manufacturers and all revisions across different series, were officially announced in NVIDIA's Q1 FY2024 financial results?",
)
display_response(response)
response = query_engine.query( "How many RTX GPU models, including all custom versions released by third-party manufacturers and all revisions across different series, were officially announced in NVIDIA's Q1 FY2024 financial results?", ) display_response(response)

Response: In NVIDIA's Q1 FY2024 financial results, the following RTX GPU models were officially announced:

1. **GeForce RTX 4060 family of GPUs**
2. **GeForce RTX 4070 GPU**
3. **Six new NVIDIA RTX GPUs for mobile and desktop workstations**

This totals to **eight RTX GPU models** announced.
Trustworthiness score: 0.74

In [ ]

已复制!

response = query_engine.query(
    "If NVIDIA's Data Center segment maintains its Q1 FY2024 quarter-over-quarter growth rate for the next four quarters, what would be its projected annual revenue?",
)
display_response(response)
response = query_engine.query( "If NVIDIA's Data Center segment maintains its Q1 FY2024 quarter-over-quarter growth rate for the next four quarters, what would be its projected annual revenue?", ) display_response(response)

Response: To calculate the projected annual revenue for NVIDIA's Data Center segment if it maintains its Q1 FY2024 quarter-over-quarter growth rate, we first need to determine the growth rate from Q4 FY2023 to Q1 FY2024.

NVIDIA reported a record Data Center revenue of $4.28 billion for Q1 FY2024. The revenue for the previous quarter (Q4 FY2023) can be calculated as follows:

Let \( R \) be the revenue for Q4 FY2023. The growth rate from Q4 FY2023 to Q1 FY2024 is given by:

\[
\text{Growth Rate} = \frac{\text{Q1 Revenue} - \text{Q4 Revenue}}{\text{Q4 Revenue}} = \frac{4.28 - R}{R}
\]

We know that the overall revenue for Q1 FY2024 is $7.19 billion, which is up 19% from the previous quarter. Therefore, we can express the revenue for Q4 FY2023 as:

\[
\text{Q1 FY2024 Revenue} = \text{Q4 FY2023 Revenue} \times 1.19
\]

Substituting the known value:

\[
7.19 = R \times 1.19
\]

Solving for \( R \):

\[
R = \frac{7.19}{1.19} \approx 6.03 \text{ billion}
\]

Now, we can calculate the Data Center revenue for Q4 FY2023. Since we don't have the exact figure for the Data Center revenue in Q4 FY2023, we will assume that the Data Center revenue also grew by the same percentage as the overall revenue. 

Now, we can calculate the quarter-over-quarter growth rate for the Data Center segment:

\[
\text{Growth Rate} = \frac{4.28 - R_D}{R_D}
\]

Where \( R_D \) is the Data Center revenue for Q4 FY2023. However, we need to find \( R_D \) first. 

Assuming the Data Center revenue was a certain percentage of the total revenue in Q4 FY2023, we can estimate it. For simplicity, let's assume the Data Center revenue was around 50% of the total revenue in Q4 FY2023 (this is a rough estimate, as we don't have the exact figure).

Thus, \( R_D \approx 0.5 \times 6
Trustworthiness score: 0.69

TLM 通过较低的可信度评分自动提醒我们这些答案不可靠。带有 TLM 的 RAG 系统有助于您在看到较低可信度评分时适当谨慎。以下是上述问题的正确答案：

NVIDIA's revenue increased by $1.14 billion this quarter compared to last quarter.

Google, Amazon Web Services, Microsoft, Oracle, ServiceNow, Medtronic, Dell Technologies.

There is not a specific total count of RTX GPUs mentioned.

Projected annual revenue if this growth rate is maintained for the next four quarters: approximately $26.34 billion.

使用 TLM，您可以轻松提高任何 RAG 系统的可信度！

欢迎查看TLM 的性能基准以获取更多详细信息。