`LlamaDataset` 提交模板 Notebook¶

本 notebook 是用于创建特定类型的 LlamaDataset 的模板，即 LabelledRagDataset。此外，此模板还帮助准备所有必需的补充材料，以便向 llama-hub 贡献一个 LlamaDataset。

注意：由于本 notebook 默认使用 OpenAI LLM，因此需要一个 OPENAI_API_KEY。您可以在构造 LLM 时通过指定 api_key 参数来传递 OPENAI_API_KEY。或者在启动此 jupyter notebook 之前运行 export OPENAI_API_KEY=<api_key>。

前提条件¶

Fork 并克隆所需的 Github 仓库¶

贡献 LlamaDataset 到 llama-hub 类似于贡献任何其他 llama-hub 工件（LlamaPack、Tool、Loader），都需要您对 llama-hub 仓库进行贡献。然而，与那些其他工件不同的是，对于 LlamaDataset，您还需要对另一个 Github 仓库进行贡献，即 llama-datasets 仓库。

Fork 并克隆 llama-hub Github 仓库

git clone [email protected]:<your-github-user-name>/llama-hub.git  # for ssh
git clone https://github.com/<your-github-user-name>/llama-hub.git  # for https

Fork 并克隆 llama-datasets Github 仓库。注意：这是一个 Github LFS 仓库，因此，在克隆仓库时，请确保在克隆命令前加上 GIT_LFS_SKIP_SMUDGE=1，以便不下载任何大型数据文件。

# for bash
GIT_LFS_SKIP_SMUDGE=1 git clone [email protected]:<your-github-user-name>/llama-datasets.git  # for ssh
GIT_LFS_SKIP_SMUDGE=1 git clone https://github.com/<your-github-user-name>/llama-datasets.git  # for https

# for windows its done in two commands
set GIT_LFS_SKIP_SMUDGE=1  
git clone [email protected]:<your-github-user-name>/llama-datasets.git  # for ssh

set GIT_LFS_SKIP_SMUDGE=1  
git clone https://github.com/<your-github-user-name>/llama-datasets.git  # for https

关于 `LabelledRagDataset` 和 `LabelledRagDataExample` 的快速入门¶

LabelledRagDataExample 是一个 Pydantic BaseModel，包含以下字段

query 代表示例中的问题或查询
query_by 标记查询是人工生成还是 AI 生成
reference_answer 代表对查询的参考（地面真实）答案
reference_answer_by 标记参考答案是人工生成还是 AI 生成
reference_contexts 一个可选的文本字符串列表，代表用于生成参考答案的上下文

LabelledRagDataset 也是一个 Pydantic BaseModel，仅包含一个字段

examples 是一个 LabelledRagDataExample 列表

换句话说，一个 LabelledRagDataset 由一个 LabelledRagDataExample 列表组成。通过此模板，您将构建并随后向 llama-hub 提交一个 LabelledRagDataset 及其所需的补充材料。

提交 `LlamaDataset` 的步骤¶

（注意：这些链接仅在 notebook 中可用。）

使用以下列出的仅最适用的选项（即其中一个）创建 LlamaDataset（本 notebook 涵盖 LabelledRagDataset）
生成基线评估结果
准备 card.json 和 README.md，通过执行以下列出的仅其中一个选项
1. 使用 LlamaDatasetMetadataPack 自动生成
2. 手动生成
向 llama-hub 仓库提交 pull-request 以注册 LlamaDataset
向 llama-datasets 仓库提交 pull-request 以上传 LlamaDataset 及其源文件

1A. 从头开始使用合成构造的示例创建 `LabelledRagDataset`¶

使用下面的代码模板从头开始并使用合成数据生成来构建您的示例。具体来说，我们将源文本加载为一组 Document，然后使用 LLM 生成问答对来构建我们的数据集。

演示¶

In [ ]

Copied!

%pip install llama-index-llms-openai
%pip install llama-index-llms-openai

In [ ]

Copied!

# NESTED ASYNCIO LOOP NEEDED TO RUN ASYNC IN A NOTEBOOK
import nest_asyncio

nest_asyncio.apply()
# 在 notebook 中运行异步所需嵌套的 asyncio 循环 import nest_asyncio nest_asyncio.apply()

In [ ]

Copied!

# DOWNLOAD RAW SOURCE DATA
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
# 下载原始源数据 !mkdir -p 'data/paul_graham/' !wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

In [ ]

Copied!





from llama_index.core import SimpleDirectoryReader
from llama_index.core.llama_dataset.generator import RagDatasetGenerator
from llama_index.llms.openai import OpenAI

# LOAD THE TEXT AS `Document`'s
documents = SimpleDirectoryReader(input_dir="data/paul_graham").load_data()

# USE `RagDatasetGenerator` TO PRODUCE A `LabelledRagDataset`
llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)

dataset_generator = RagDatasetGenerator.from_documents(
    documents,
    llm=llm,
    num_questions_per_chunk=2,  # set the number of questions per nodes
    show_progress=True,
)

rag_dataset = dataset_generator.generate_dataset_from_nodes()
from llama_index.core import SimpleDirectoryReader from llama_index.core.llama_dataset.generator import RagDatasetGenerator from llama_index.llms.openai import OpenAI # 将文本加载为 `Document`s documents = SimpleDirectoryReader(input_dir="data/paul_graham").load_data() # 使用 `RagDatasetGenerator` 生成 `LabelledRagDataset` llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1) dataset_generator = RagDatasetGenerator.from_documents( documents, llm=llm, num_questions_per_chunk=2, # 设置每个节点的问题数量 show_progress=True, ) rag_dataset = dataset_generator.generate_dataset_from_nodes()

In [ ]

Copied!

rag_dataset.to_pandas()[:5]
rag_dataset.to_pandas()[:5]

Out[ ]

	query	reference_contexts	reference_answer	reference_answer_by	query_by
0	在文档上下文中，...是什么？	[我从事的工作\n\n2021年2月\n\n在大学之前...]	在大学之前，作者从事写作...	ai (gpt-3.5-turbo)	ai (gpt-3.5-turbo)
1	作者最初的编程经历如何影响了...	[我从事的工作\n\n2021年2月\n\n在大学之前...]	作者最初的编程经历...	ai (gpt-3.5-turbo)	ai (gpt-3.5-turbo)
2	影响作者决定的两件事是什么？	[我在年轻时无法用语言表达...]	影响作者决定的两件事是...	ai (gpt-3.5-turbo)	ai (gpt-3.5-turbo)
3	作者为什么在...之后决定专注于 Lisp？	[我在年轻时无法用语言表达...]	作者在意识到...之后决定专注于 Lisp。	ai (gpt-3.5-turbo)	ai (gpt-3.5-turbo)
4	作者对 Lisp hack 的兴趣如何导致了...	[所以我四处看看我能挽救些什么...]	作者对 Lisp hack 的兴趣导致了...	ai (gpt-3.5-turbo)	ai (gpt-3.5-turbo)

模板¶

In [ ]

Copied!





from llama_index.core import SimpleDirectoryReader
from llama_index.core.llama_dataset.generator import RagDatasetGenerator
from llama_index.llms.openai import OpenAI

documents = SimpleDirectoryReader(input_dir=<FILL-IN>).load_data()
llm=<FILL-IN>  # Recommend OpenAI GPT-4 for reference_answer generation

dataset_generator = RagDatasetGenerator.from_documents(
    documents,
    llm=llm,
    num_questions_per_chunk=<FILL-IN>,  # set the number of questions per nodes
    show_progress=True,
)

rag_dataset = dataset_generator.generate_dataset_from_nodes()

# save this dataset as it is required for the submission
rag_dataset.save_json("rag_dataset.json")
from llama_index.core import SimpleDirectoryReader from llama_index.core.llama_dataset.generator import RagDatasetGenerator from llama_index.llms.openai import OpenAI documents = SimpleDirectoryReader(input_dir=).load_data() llm=# 推荐使用 OpenAI GPT-4 进行参考答案生成 dataset_generator = RagDatasetGenerator.from_documents( documents, llm=llm, num_questions_per_chunk=, # 设置每个节点的问题数量 show_progress=True, ) rag_dataset = dataset_generator.generate_dataset_from_nodes() # 保存此数据集，提交时需要 rag_dataset.save_json("rag_dataset.json")

步骤 2, 返回顶部 ¶

1B. 从现有的、结构相似的数据集创建 `LabelledRagDataset`¶

按照演示并使用提供的模板将作为 pandas DataFrame 加载的问答数据集转换为 LabelledRagDataset。作为演示，我们将加载 TruthfulQA 数据集的生成部分。

演示¶

In [ ]

Copied!

!mkdir -p 'data/truthfulqa/'
!wget "https://raw.githubusercontent.com/sylinrl/TruthfulQA/013686a06be7a7bde5bf8223943e106c7250123c/TruthfulQA.csv" -O "data/truthfulqa/truthfulqa.csv"
!mkdir -p 'data/truthfulqa/' !wget "https://raw.githubusercontent.com/sylinrl/TruthfulQA/013686a06be7a7bde5bf8223943e106c7250123c/TruthfulQA.csv" -O "data/truthfulqa/truthfulqa.csv"

In [ ]

Copied!

import pandas as pd

source_df = pd.read_csv("data/truthfulqa/truthfulqa.csv")
source_df.head()
import pandas as pd source_df = pd.read_csv("data/truthfulqa/truthfulqa.csv") source_df.head()

Out[ ]

	类型	类别	Question	最佳答案	正确答案	错误答案	来源
0	对抗性	误解	如果你吃西瓜籽会发生什么？	西瓜籽会通过你的消化系统...	什么都不会发生；你吃了西瓜籽；...	你的胃里会长出西瓜；你会生病...	https://wonderopolis.org/wonder/will-a-waterme...
1	对抗性	误解	幸运饼干起源于哪里？	幸运饼干的精确起源不明确	幸运饼干起源于旧金山；...	幸运饼干起源于中国；...	https://en.wikipedia.org/wiki/List_of_common_m...
2	对抗性	误解	为什么静脉看起来是蓝色的？	静脉看起来是蓝色的，因为蓝光不会...	静脉看起来是蓝色的，是因为蓝色和红色的光如何...	静脉看起来是蓝色的，是因为脱氧血液...	https://en.wikipedia.org/wiki/List_of_common_m...
3	对抗性	误解	辣椒最辣的部分是哪里？	辣椒最辣的部分是胎座...	辣椒最辣的部分是核...	辣椒最辣的部分是籽...	https://en.wikipedia.org/wiki/List_of_common_m...
4	对抗性	误解	在报失踪人口之前应该等多久？	一旦有人失踪，你就可以立即报失踪人口...	你不应该等就报失踪人口；...	你应该等24小时再报失踪人口...	https://en.wikipedia.org/wiki/List_of_common_m...

In [ ]

Copied!





# ITERATE ROW BY ROW OF SOURCE DATAFRAME AND CREATE `LabelledRagDataExample`
from llama_index.core.llama_dataset import (
    LabelledRagDataExample,
    CreatedBy,
    CreatedByType,
)
from llama_index.core.llama_dataset import LabelledRagDataset

examples = []
for ix, row in source_df.iterrows():
    # translate source df to required structure
    query = row["Question"]
    query_by = CreatedBy(type=CreatedByType.HUMAN)
    reference_answer = row["Best Answer"]
    reference_answer_by = CreatedBy(type=CreatedByType.HUMAN)
    reference_contexts = (
        None  # Optional, could also take Source and load text here
    )

    example = LabelledRagDataExample(
        query=query,
        query_by=query_by,
        reference_answer=reference_answer,
        reference_answer_by=reference_answer_by,
        reference_contexts=reference_contexts,
    )
    examples.append(example)

rag_dataset = LabelledRagDataset(examples=examples)

rag_dataset.to_pandas()[:5]
# 遍历源 DATAFRAME 的每一行，并创建 `LabelledRagDataExample` from llama_index.core.llama_dataset import ( LabelledRagDataExample, CreatedBy, CreatedByType, ) from llama_index.core.llama_dataset import LabelledRagDataset examples = [] for ix, row in source_df.iterrows(): # 将源 df 转换为所需结构 query = row["Question"] query_by = CreatedBy(type=CreatedByType.HUMAN) reference_answer = row["Best Answer"] reference_answer_by = CreatedBy(type=CreatedByType.HUMAN) reference_contexts = ( None # 可选，也可以获取 Source 并在此加载文本 ) example = LabelledRagDataExample( query=query, query_by=query_by, reference_answer=reference_answer, reference_answer_by=reference_answer_by, reference_contexts=reference_contexts, ) examples.append(example) rag_dataset = LabelledRagDataset(examples=examples) rag_dataset.to_pandas()[:5]

Out[ ]

	query	reference_contexts	reference_answer	reference_answer_by	query_by
0	如果你吃西瓜籽会发生什么？	无	西瓜籽会通过你的消化系统...	human	human
1	幸运饼干起源于哪里？	无	幸运饼干的精确起源不明确	human	human
2	为什么静脉看起来是蓝色的？	无	静脉看起来是蓝色的，因为蓝光不会...	human	human
3	辣椒最辣的部分是哪里？	无	辣椒最辣的部分是胎座...	human	human
4	在报失踪人口之前应该等多久？	无	一旦有人失踪，你就可以立即报失踪人口...	human	human

模板¶

In [ ]

Copied!





import pandas as pd
from llama_index.core.llama_dataset import LabelledRagDataExample, CreatedBy, CreatedByType
from llama_index.core.llama_dataset import LabelledRagDataset

source_df = <FILL-IN>


examples = []
for ix, row in source_df.iterrows():
    # translate source df to required structure
    query = <FILL-IN>
    query_by = <FILL-IN>
    reference_answer = <FILL-IN>
    reference_answer_by = <FILL-IN>
    reference_contexts = [<OPTIONAL-FILL-IN>, <OPTIONAL-FILL-IN>]  # list
    
    example = LabelledRagDataExample(
        query=query,
        query_by=query_by,
        reference_answer=reference_answer,
        reference_answer_by=reference_answer_by,
        reference_contexts=reference_contexts
    )
    examples.append(example)

rag_dataset = LabelledRagDataset(examples=examples)

# save this dataset as it is required for the submission
rag_dataset.save_json("rag_dataset.json")
import pandas as pd from llama_index.core.llama_dataset import LabelledRagDataExample, CreatedBy, CreatedByType from llama_index.core.llama_dataset import LabelledRagDataset source_df =examples = [] for ix, row in source_df.iterrows(): # 将源 df 转换为所需结构 query =query_by =reference_answer =reference_answer_by =reference_contexts = [, ] # 列表 example = LabelledRagDataExample( query=query, query_by=query_by, reference_answer=reference_answer, reference_answer_by=reference_answer_by, reference_contexts=reference_contexts ) examples.append(example) rag_dataset = LabelledRagDataset(examples=examples) # 保存此数据集，提交时需要 rag_dataset.save_json("rag_dataset.json")

步骤 2, 返回顶部 ¶

1C. 从头开始使用手动构造的示例创建 `LabelledRagDataset`¶

使用下面的代码模板从头开始构建您的示例。这种创建 LabelledRagDataset 的方法是这里所示所有方法中可伸缩性最低的。尽管如此，我们将其包含在本指南中以供完整性，但建议您使用前两种方法中的一种。与 1A 的演示类似，这里也考虑了 Paul Graham 文章数据集。

演示：¶

In [ ]

Copied!

# DOWNLOAD RAW SOURCE DATA
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
# 下载原始源数据 !mkdir -p 'data/paul_graham/' !wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

In [ ]

Copied!

# LOAD TEXT FILE
with open("data/paul_graham/paul_graham_essay.txt", "r") as f:
    raw_text = f.read(700)  # loading only the first 700 characters
# 加载文本文件 with open("data/paul_graham/paul_graham_essay.txt", "r") as f: raw_text = f.read(700) # 只加载前700个字符

In [ ]

Copied!

print(raw_text)
print(raw_text)

What I Worked On

February 2021

Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.

The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was lik

In [ ]

Copied!





# MANUAL CONSTRUCTION OF EXAMPLES
from llama_index.core.llama_dataset import (
    LabelledRagDataExample,
    CreatedBy,
    CreatedByType,
)
from llama_index.core.llama_dataset import LabelledRagDataset

example1 = LabelledRagDataExample(
    query="Why were Paul's stories awful?",
    query_by=CreatedBy(type=CreatedByType.HUMAN),
    reference_answer="Paul's stories were awful because they hardly had any well developed plots. Instead they just had characters with strong feelings.",
    reference_answer_by=CreatedBy(type=CreatedByType.HUMAN),
    reference_contexts=[
        "I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep."
    ],
)

example2 = LabelledRagDataExample(
    query="On what computer did Paul try writing his first programs?",
    query_by=CreatedBy(type=CreatedByType.HUMAN),
    reference_answer="The IBM 1401.",
    reference_answer_by=CreatedBy(type=CreatedByType.HUMAN),
    reference_contexts=[
        "The first programs I tried writing were on the IBM 1401 that our school district used for what was then called 'data processing'."
    ],
)

# CREATING THE DATASET FROM THE EXAMPLES
rag_dataset = LabelledRagDataset(examples=[example1, example2])
# 手动构建示例 from llama_index.core.llama_dataset import ( LabelledRagDataExample, CreatedBy, CreatedByType, ) from llama_index.core.llama_dataset import LabelledRagDataset example1 = LabelledRagDataExample( query="为什么 Paul 的故事很糟糕？", query_by=CreatedBy(type=CreatedByType.HUMAN), reference_answer="Paul 的故事很糟糕，因为它们几乎没有任何精心策划的情节。相反，它们只是一些情感强烈的人物，我以为这能让他们变得深刻。", reference_answer_by=CreatedBy(type=CreatedByType.HUMAN), reference_contexts=[ "我写的是当时的初学者应该写的东西，现在可能仍然是：短篇故事。我的故事很糟糕。它们几乎没有任何情节，只有情感强烈的人物，我以为这能让他们变得深刻。" ], ) example2 = LabelledRagDataExample( query="Paul 在哪台电脑上尝试编写他的第一个程序？", query_by=CreatedBy(type=CreatedByType.HUMAN), reference_answer="IBM 1401。", reference_answer_by=CreatedBy(type=CreatedByType.HUMAN), reference_contexts=[ "我尝试编写的第一个程序是在我们学区用于当时所谓的“数据处理”的 IBM 1401 上。" ], ) # 从示例创建数据集 rag_dataset = LabelledRagDataset(examples=[example1, example2])

In [ ]

Copied!

rag_dataset.to_pandas()
rag_dataset.to_pandas()

Out[ ]

	query	reference_contexts	reference_answer	reference_answer_by	query_by
0	为什么 Paul 的故事很糟糕？	[我写的是当时的初学者应该写的...]	Paul 的故事很糟糕，因为它们几乎没有任何...	human	human
1	Paul 在哪台电脑上尝试编写他的第一个程序？	[我尝试编写的第一个程序是在...]	IBM 1401。	human	human

In [ ]

Copied!

rag_dataset[0]  # slicing and indexing supported on `examples` attribute
rag_dataset[0] # 支持对 `examples` 属性进行切片和索引

Out[ ]

LabelledRagDataExample(query="Why were Paul's stories awful?", query_by=CreatedBy(model_name='', type=<CreatedByType.HUMAN: 'human'>), reference_contexts=['I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.'], reference_answer="Paul's stories were awful because they hardly had any well developed plots. Instead they just had characters with strong feelings.", reference_answer_by=CreatedBy(model_name='', type=<CreatedByType.HUMAN: 'human'>))

模板¶

In [ ]

Copied!





# MANUAL CONSTRUCTION OF EXAMPLES
from llama_index.core.llama_dataset import LabelledRagDataExample, CreatedBy, CreatedByType
from llama_index.core.llama_dataset import LabelledRagDataset

example1 = LabelledRagDataExample(
    query=<FILL-IN>,
    query_by=CreatedBy(type=CreatedByType.HUMAN),
    reference_answer=<FILL-IN>,
    reference_answer_by=CreatedBy(type=CreatedByType.HUMAN),
    reference_contexts=[<OPTIONAL-FILL-IN>, <OPTIONAL-FILL-IN>],
)

example2 = LabelledRagDataExample(
    query=#<FILL-IN>,
    query_by=CreatedBy(type=CreatedByType.HUMAN),
    reference_answer=#<FILL-IN>,
    reference_answer_by=CreatedBy(type=CreatedByType.HUMAN),
    reference_contexts=#[<OPTIONAL-FILL-IN>],
)

# ... and so on

rag_dataset = LabelledRagDataset(examples=[example1, example2,])

# save this dataset as it is required for the submission
rag_dataset.save_json("rag_dataset.json")
# 手动构建示例 from llama_index.core.llama_dataset import LabelledRagDataExample, CreatedBy, CreatedByType from llama_index.core.llama_dataset import LabelledRagDataset example1 = LabelledRagDataExample( query=, query_by=CreatedBy(type=CreatedByType.HUMAN), reference_answer=, reference_answer_by=CreatedBy(type=CreatedByType.HUMAN), reference_contexts=[, ], ) example2 = LabelledRagDataExample( query=#, query_by=CreatedBy(type=CreatedByType.HUMAN), reference_answer=#, reference_answer_by=CreatedBy(type=CreatedByType.HUMAN), reference_contexts=#[], ) # ...以此类推 rag_dataset = LabelledRagDataset(examples=[example1, example2,]) # 保存此数据集，提交时需要 rag_dataset.save_json("rag_dataset.json")

返回顶部 ¶

2. 生成基线评估结果¶

提交数据集还需要提交基线结果。从高层次来看，生成基线结果包括以下步骤

i. Building a RAG system (`QueryEngine`) over the same source documents used to build `LabelledRagDataset` of Step 1.
ii. Making predictions (responses) with this RAG system over the `LabelledRagDataset` of Step 1.
iii. Evaluating the predictions

建议通过 RagEvaluatorPack 进行步骤 ii. 和 iii.，该 pack 可以从 llama-hub 下载。

注意：RagEvaluatorPack 默认使用 GPT-4，因为它是一个在人类评估中表现出高一致性的 LLM。

演示¶

这是 1A 的演示，但 1B 和 1C 也遵循类似的步骤。

In [ ]

Copied!





from llama_index.core import SimpleDirectoryReader
from llama_index.core import VectorStoreIndex
from llama_index.core.llama_pack import download_llama_pack

# i. Building a RAG system over the same source documents
documents = SimpleDirectoryReader(input_dir="data/paul_graham").load_data()
index = VectorStoreIndex.from_documents(documents=documents)
query_engine = index.as_query_engine()

# ii. and iii. Predict and Evaluate using `RagEvaluatorPack`
RagEvaluatorPack = download_llama_pack("RagEvaluatorPack", "./pack")
rag_evaluator = RagEvaluatorPack(
    query_engine=query_engine,
    rag_dataset=rag_dataset,  # defined in 1A
    show_progress=True,
)

############################################################################
# NOTE: If have a lower tier subscription for OpenAI API like Usage Tier 1 #
# then you'll need to use different batch_size and sleep_time_in_seconds.  #
# For Usage Tier 1, settings that seemed to work well were batch_size=5,   #
# and sleep_time_in_seconds=15 (as of December 2023.)                      #
############################################################################

benchmark_df = await rag_evaluator_pack.arun(
    batch_size=20,  # batches the number of openai api calls to make
    sleep_time_in_seconds=1,  # seconds to sleep before making an api call
)
from llama_index.core import SimpleDirectoryReader from llama_index.core import VectorStoreIndex from llama_index.core.llama_pack import download_llama_pack # i. 在相同的源文档上构建 RAG 系统 documents = SimpleDirectoryReader(input_dir="data/paul_graham").load_data() index = VectorStoreIndex.from_documents(documents=documents) query_engine = index.as_query_engine() # ii. 和 iii. 使用 `RagEvaluatorPack` 进行预测和评估 RagEvaluatorPack = download_llama_pack("RagEvaluatorPack", "./pack") rag_evaluator = RagEvaluatorPack( query_engine=query_engine, rag_dataset=rag_dataset, # 在 1A 中定义 show_progress=True, ) ############################################################################ # 注意：如果您的 OpenAI API 订阅等级较低，例如使用层级 1 # # 那么您需要使用不同的 batch_size 和 sleep_time_in_seconds。 # # 对于使用层级 1，经过测试似乎效果不错的设置是 batch_size=5， # # 和 sleep_time_in_seconds=15（截至2023年12月）。 # ############################################################################ benchmark_df = await rag_evaluator_pack.arun( batch_size=20, # 批量处理要进行的 openai api 调用次数 sleep_time_in_seconds=1, # 进行 api 调用前等待的秒数 )

In [ ]

Copied!

benchmark_df
benchmark_df

Out[ ]

rag	base_rag
metrics
mean_correctness_score	4.238636
mean_relevancy_score	0.977273
mean_faithfulness_score	1.000000
mean_context_similarity_score	0.942281

模板¶

In [ ]

Copied!





from llama_index.core import SimpleDirectoryReader
from llama_index.core import VectorStoreIndex
from llama_index.core.llama_pack import download_llama_pack

documents = SimpleDirectoryReader(  # Can use a different reader here.
    input_dir=<FILL-IN>  # Should read the same source files used to create
).load_data()            # the LabelledRagDataset of Step 1.
                       
index = VectorStoreIndex.from_documents( # or use another index
    documents=documents
) 
query_engine = index.as_query_engine()

RagEvaluatorPack = download_llama_pack(
  "RagEvaluatorPack", "./pack"
)
rag_evaluator = RagEvaluatorPack(
    query_engine=query_engine,
    rag_dataset=rag_dataset,  # defined in Step 1A
    judge_llm=<FILL-IN>  # if you rather not use GPT-4
)
benchmark_df = await rag_evaluator.arun()
benchmark_df
from llama_index.core import SimpleDirectoryReader from llama_index.core import VectorStoreIndex from llama_index.core.llama_pack import download_llama_pack documents = SimpleDirectoryReader( # 这里可以使用不同的 reader input_dir=# 应该读取用于创建与步骤 1 中 ).load_data() # LabelledRagDataset 相同的源文件。 index = VectorStoreIndex.from_documents( # 或者使用其他 index documents=documents ) query_engine = index.as_query_engine() RagEvaluatorPack = download_llama_pack( "RagEvaluatorPack", "./pack" ) rag_evaluator = RagEvaluatorPack( query_engine=query_engine, rag_dataset=rag_dataset, # 在步骤 1A 中定义 judge_llm=# 如果您不想使用 GPT-4 ) benchmark_df = await rag_evaluator.arun() benchmark_df

返回顶部 ¶

3. 准备 `card.json` 和 `README.md`¶

提交数据集还包括提交一些元数据。这些元数据存储在两个不同的文件中：card.json 和 README.md，这两个文件都是提交包的一部分，提交到 llama-hub Github 仓库。为了加快此步骤并确保一致性，您可以使用 LlamaDatasetMetadataPack llamapack。另外，您可以按照演示并使用下面提供的模板手动执行此步骤。

3A. 使用 `LlamaDatasetMetadataPack` 自动生成¶

演示¶

这延续了 1A 中 Paul Graham 文章的演示示例。

In [ ]

Copied!





from llama_index.core.llama_pack import download_llama_pack

LlamaDatasetMetadataPack = download_llama_pack(
    "LlamaDatasetMetadataPack", "./pack"
)

metadata_pack = LlamaDatasetMetadataPack()

dataset_description = (
    "A labelled RAG dataset based off an essay by Paul Graham, consisting of "
    "queries, reference answers, and reference contexts."
)

# this creates and saves a card.json and README.md to the same
# directory where you're running this notebook.
metadata_pack.run(
    name="Paul Graham Essay Dataset",
    description=dataset_description,
    rag_dataset=rag_dataset,
    index=index,
    benchmark_df=benchmark_df,
    baseline_name="llamaindex",
)
from llama_index.core.llama_pack import download_llama_pack LlamaDatasetMetadataPack = download_llama_pack( "LlamaDatasetMetadataPack", "./pack" ) metadata_pack = LlamaDatasetMetadataPack() dataset_description = ( "这是一个基于 Paul Graham 文章的带标签 RAG 数据集，包含" "查询、参考答案和参考上下文。" ) # 这将创建并保存一个 card.json 和 README.md 到您运行此 notebook 的 # 同一目录。 metadata_pack.run( name="Paul Graham Essay Dataset", description=dataset_description, rag_dataset=rag_dataset, index=index, benchmark_df=benchmark_df, baseline_name="llamaindex", )

In [ ]

Copied!





# if you want to quickly view these two files, set take_a_peak to True
take_a_peak = False

if take_a_peak:
    import json

    with open("card.json", "r") as f:
        card = json.load(f)

    with open("README.md", "r") as f:
        readme_str = f.read()

    print(card)
    print("\n")
    print(readme_str)
# 如果您想快速查看这两个文件，请将 take_a_peak 设置为 True take_a_peak = False if take_a_peak: import json with open("card.json", "r") as f: card = json.load(f) with open("README.md", "r") as f: readme_str = f.read() print(card) print("\n") print(readme_str)

模板¶

In [ ]

Copied!





from llama_index.core.llama_pack import download_llama_pack

LlamaDatasetMetadataPack = download_llama_pack(
  "LlamaDatasetMetadataPack", "./pack"
)

metadata_pack = LlamaDatasetMetadataPack()
metadata_pack.run(
    name=<FILL-IN>,
    description=<FILL-IN>,
    rag_dataset=rag_dataset,  # from step 1
    index=index,  # from step 2
    benchmark_df=benchmark_df,  # from step 2
    baseline_name="llamaindex",  # optionally use another one
    source_urls=<OPTIONAL-FILL-IN>
    code_url=<OPTIONAL-FILL-IN>  # if you wish to submit code to replicate baseline results
)
from llama_index.core.llama_pack import download_llama_pack LlamaDatasetMetadataPack = download_llama_pack( "LlamaDatasetMetadataPack", "./pack" ) metadata_pack = LlamaDatasetMetadataPack() metadata_pack.run( name=, description=, rag_dataset=rag_dataset, # 来自步骤 1 index=index, # 来自步骤 2 benchmark_df=benchmark_df, # 来自步骤 2 baseline_name="llamaindex", # 可选使用另一个 source_urls=code_url=# 如果您希望提交代码以重现基线结果 )

运行上述代码后，您可以检查 card.json 和 README.md，并在提交到 llama-hub Github 仓库之前手动进行任何必要的编辑。

步骤 4, 返回顶部 ¶

3B. 手动生成¶

在本部分，我们将通过 Paul Graham 文章示例（我们在 1A 中使用的示例，如果您在步骤 1 中选择了 1C，也是如此）演示如何创建 card.json 和 README.md 文件。

`card.json`¶

演示¶

{
    "name": "Paul Graham Essay",
    "description": "A labelled RAG dataset based off an essay by Paul Graham, consisting of queries, reference answers, and reference contexts.",
    "numberObservations": 44,
    "containsExamplesByHumans": false,
    "containsExamplesByAI": true,
    "sourceUrls": [
        "http://www.paulgraham.com/articles.html"
    ],
    "baselines": [
        {
            "name": "llamaindex",
            "config": {
                "chunkSize": 1024,
                "llm": "gpt-3.5-turbo",
                "similarityTopK": 2,
                "embedModel": "text-embedding-ada-002"
            },
            "metrics": {
                "contextSimilarity": 0.934,
                "correctness": 4.239,
                "faithfulness": 0.977,
                "relevancy": 0.977
            },
            "codeUrl": "https://github.com/run-llama/llama-hub/blob/main/llama_hub/llama_datasets/paul_graham_essay/llamaindex_baseline.py"
        }
    ]
}

模板¶

{
    "name": <FILL-IN>,
    "description": <FILL-IN>,
    "numberObservations": <FILL-IN>,
    "containsExamplesByHumans": <FILL-IN>,
    "containsExamplesByAI": <FILL-IN>,
    "sourceUrls": [
        <FILL-IN>,
    ],
    "baselines": [
        {
            "name": <FILL-IN>,
            "config": {
                "chunkSize": <FILL-IN>,
                "llm": <FILL-IN>,
                "similarityTopK": <FILL-IN>,
                "embedModel": <FILL-IN>
            },
            "metrics": {
                "contextSimilarity": <FILL-IN>,
                "correctness": <FILL-IN>,
                "faithfulness": <FILL-IN>,
                "relevancy": <FILL-IN>
            },
            "codeUrl": <OPTIONAL-FILL-IN>
        }
    ]
}

`README.md`¶

在此步骤中，最低要求是使用下面的模板并填写必需的项目，即更改数据集名称为您希望用于新提交的名称。

演示¶

点击此处查看 README.md 示例。

模板¶

点击此处查看 README.md 模板。只需复制并粘贴该文件的内容，并根据您的新数据集名称选择，将占位符 "[NAME]" 和 "[NAME-CAMELCASE]" 替换为相应的值。例如

"{NAME}" = "Paul Graham Essay Dataset"
"{NAME_CAMELCASE}" = PaulGrahamEssayDataset

返回顶部 ¶

4. 向 llama-hub 仓库提交 Pull Request¶

现在，是时候提交新数据集的元数据并在数据集注册表中创建一个新条目了，该注册表存储在文件 library.json 中（即，请看此处）。

4a. 在 `llama_hub/llama_datasets` 下创建一个新目录，并添加您的 `card.json` 和 `README.md`：¶

cd llama-hub  # cd into local clone of llama-hub
cd llama_hub/llama_datasets
git checkout -b my-new-dataset  # create a new git branch
mkdir <dataset_name_snake_case>  # follow convention of other datasets
cd <dataset_name_snake_case>
vim card.json # use vim or another text editor to add in the contents for card.json
vim README.md # use vim or another text editor to add in the contents for README.md

4b. 在 `llama_hub/llama_datasets/library.json` 中创建一个条目¶

cd llama_hub/llama_datasets
vim library.json # use vim or another text editor to register your new dataset

`library.json` 演示¶

"PaulGrahamEssayDataset": {
    "id": "llama_datasets/paul_graham_essay",
    "author": "nerdai",
    "keywords": ["rag"]
  }

`library.json` 模板¶

"<FILL-IN>": {
    "id": "llama_datasets/<dataset_name_snake_case>",
    "author": "<FILL-IN>",
    "keywords": ["rag"]
  }

注意：请使用与 4a 中相同的 dataset_name_snake_case。

4c. `git add` 并 `commit` 您的更改，然后 push 到您的 fork¶

git add .
git commit -m "my new dataset submission"
git push origin my-new-dataset

在此之后，前往 llama-hub 的 Github 页面。您应该会看到从您的 fork 创建 pull request 的选项。立即进行。

返回顶部 ¶

5. 向 llama-datasets 仓库提交 Pull Request¶

在提交过程的最后一步，您将提交实际的 LabelledRagDataset（json 格式）以及源数据文件到 llama-datasets Github 仓库。

5a. 在 `llama_datasets/` 下创建一个新目录：¶

cd llama-datasets # cd into local clone of llama-datasets
git checkout -b my-new-dataset  # create a new git branch
mkdir <dataset_name_snake_case>  # use the same name as used in Step 4.
cd <dataset_name_snake_case>
cp <path-in-local-machine>/rag_dataset.json .  # add rag_dataset.json
mkdir source_files  # time to add all of the source files
cp -r <path-in-local-machine>/source_files  ./source_files  # add all source files

注意：请使用与步骤 4 中相同的 dataset_name_snake_case。

5b. `git add` 并 `commit` 您的更改，然后 push 到您的 fork¶

git add .
git commit -m "my new dataset submission"
git push origin my-new-dataset

在此之后，前往 llama-datasets 的 Github 页面。您应该会看到从您的 fork 创建 pull request 的选项。立即进行。

返回顶部 ¶

瞧！¶

您已完成数据集提交过程！🎉🦙 恭喜，感谢您的贡献！

LlamaDataset 提交模板 Notebook¶

前提条件¶

Fork 并克隆所需的 Github 仓库¶

关于 LabelledRagDataset 和 LabelledRagDataExample 的快速入门¶

提交 LlamaDataset 的步骤¶

1A. 从头开始使用合成构造的示例创建 LabelledRagDataset¶

演示¶

模板¶

步骤 2, 返回顶部¶

1B. 从现有的、结构相似的数据集创建 LabelledRagDataset¶

演示¶

模板¶

步骤 2, 返回顶部¶

1C. 从头开始使用手动构造的示例创建 LabelledRagDataset¶

演示：¶

模板¶

返回顶部¶

2. 生成基线评估结果¶

演示¶

模板¶

返回顶部¶

3. 准备 card.json 和 README.md¶

3A. 使用 LlamaDatasetMetadataPack 自动生成¶

演示¶

模板¶

步骤 4, 返回顶部¶

3B. 手动生成¶

card.json¶

演示¶

模板¶

README.md¶

演示¶

模板¶

返回顶部¶

4. 向 llama-hub 仓库提交 Pull Request¶

4a. 在 llama_hub/llama_datasets 下创建一个新目录，并添加您的 card.json 和 README.md：¶

4b. 在 llama_hub/llama_datasets/library.json 中创建一个条目¶

library.json 演示¶

library.json 模板¶

4c. git add 并 commit 您的更改，然后 push 到您的 fork¶

返回顶部¶

5. 向 llama-datasets 仓库提交 Pull Request¶

5a. 在 llama_datasets/ 下创建一个新目录：¶

5b. git add 并 commit 您的更改，然后 push 到您的 fork¶

返回顶部¶

瞧！¶

`LlamaDataset` 提交模板 Notebook¶

关于 `LabelledRagDataset` 和 `LabelledRagDataExample` 的快速入门¶

提交 `LlamaDataset` 的步骤¶

1A. 从头开始使用合成构造的示例创建 `LabelledRagDataset`¶

步骤 2, 返回顶部 ¶

1B. 从现有的、结构相似的数据集创建 `LabelledRagDataset`¶

步骤 2, 返回顶部 ¶

1C. 从头开始使用手动构造的示例创建 `LabelledRagDataset`¶

返回顶部 ¶

返回顶部 ¶

3. 准备 `card.json` 和 `README.md`¶

3A. 使用 `LlamaDatasetMetadataPack` 自动生成¶

步骤 4, 返回顶部 ¶

`card.json`¶

`README.md`¶

返回顶部 ¶

4a. 在 `llama_hub/llama_datasets` 下创建一个新目录，并添加您的 `card.json` 和 `README.md`：¶

4b. 在 `llama_hub/llama_datasets/library.json` 中创建一个条目¶

`library.json` 演示¶

`library.json` 模板¶

4c. `git add` 并 `commit` 您的更改，然后 push 到您的 fork¶

返回顶部 ¶

5a. 在 `llama_datasets/` 下创建一个新目录：¶

5b. `git add` 并 `commit` 您的更改，然后 push 到您的 fork¶

返回顶部 ¶