从头构建响应合成¶

在本教程中，我们将向您展示如何从头构建 RAG 流程中的“LLM 合成”组件。给定一组检索到的节点，我们将向您展示即使检索到的上下文超出上下文窗口，也能如何合成响应。

我们将介绍一些合成策略

创建并细化
树状摘要

我们本质上是将我们的“响应合成”模块解包并暴露给用户。

我们默认使用 OpenAI 作为 LLM，但您可以随意插入任何您想要的 LLM。

设置¶

我们构建一个空的 Pinecone 索引，并定义必要的 LlamaIndex 包装器/抽象，以便我们可以加载/索引数据并获得向量检索器。

如果您在 colab 上打开此 Notebook，您可能需要安装 LlamaIndex 🦙。

In [ ]

已复制！

%pip install llama-index-readers-file pymupdf
%pip install llama-index-vector-stores-pinecone
%pip install llama-index-llms-openai
%pip install llama-index-readers-file pymupdf %pip install llama-index-vector-stores-pinecone %pip install llama-index-llms-openai

In [ ]

已复制！

!pip install llama-index
!pip install llama-index

加载数据¶

In [ ]

已复制！

!mkdir data
!wget --user-agent "Mozilla" "https://arxiv.org/pdf/2307.09288.pdf" -O "data/llama2.pdf"
!mkdir data !wget --user-agent "Mozilla" "https://arxiv.org/pdf/2307.09288.pdf" -O "data/llama2.pdf"

In [ ]

已复制！

from pathlib import Path
from llama_index.readers.file import PyMuPDFReader
from pathlib import Path from llama_index.readers.file import PyMuPDFReader

In [ ]

已复制！

loader = PyMuPDFReader()
documents = loader.load(file_path="./data/llama2.pdf")
loader = PyMuPDFReader() documents = loader.load(file_path="./data/llama2.pdf")

构建 Pinecone 索引，获取检索器¶

我们使用高级别的 LlamaIndex 抽象来 1) 将数据摄取到 Pinecone 中，然后 2) 获取一个向量检索器。

请注意，我们将块大小设置为 1024。

In [ ]

已复制！

import pinecone
import os

api_key = os.environ["PINECONE_API_KEY"]
pinecone.init(api_key=api_key, environment="us-west1-gcp")
import pinecone import os api_key = os.environ["PINECONE_API_KEY"] pinecone.init(api_key=api_key, environment="us-west1-gcp")

/Users/jerryliu/Programming/gpt_index/.venv/lib/python3.10/site-packages/pinecone/index.py:4: TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)
  from tqdm.autonotebook import tqdm

In [ ]

已复制！

# dimensions are for text-embedding-ada-002
pinecone.create_index(
    "quickstart", dimension=1536, metric="euclidean", pod_type="p1"
)
# dimensions are for text-embedding-ada-002 pinecone.create_index( "quickstart", dimension=1536, metric="euclidean", pod_type="p1" )

In [ ]

已复制！

pinecone_index = pinecone.Index("quickstart")
pinecone_index = pinecone.Index("quickstart")

In [ ]

已复制！

# [Optional] drop contents in index
pinecone_index.delete(deleteAll=True)
# [Optional] drop contents in index pinecone_index.delete(deleteAll=True)

Out[ ]

{}

In [ ]

已复制！

from llama_index.vector_stores.pinecone import PineconeVectorStore
from llama_index.core import VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import StorageContext
from llama_index.vector_stores.pinecone import PineconeVectorStore from llama_index.core import VectorStoreIndex from llama_index.core.node_parser import SentenceSplitter from llama_index.core import StorageContext

In [ ]

已复制！





vector_store = PineconeVectorStore(pinecone_index=pinecone_index)
# NOTE: set chunk size of 1024
splitter = SentenceSplitter(chunk_size=1024)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents, transformations=[splitter], storage_context=storage_context
)
vector_store = PineconeVectorStore(pinecone_index=pinecone_index) # NOTE: set chunk size of 1024 splitter = SentenceSplitter(chunk_size=1024) storage_context = StorageContext.from_defaults(vector_store=vector_store) index = VectorStoreIndex.from_documents( documents, transformations=[splitter], storage_context=storage_context )

In [ ]

已复制！

retriever = index.as_retriever()
retriever = index.as_retriever()

给定一个示例问题，获取一组检索到的节点。¶

我们使用检索器根据用户查询获取一组相关的节点。然后这些节点将被传递给下面的响应合成模块。

In [ ]

已复制！

query_str = (
    "Can you tell me about results from RLHF using both model-based and"
    " human-based evaluation?"
)
query_str = ( "Can you tell me about results from RLHF using both model-based and" " human-based evaluation?" )

In [ ]

已复制！

retrieved_nodes = retriever.retrieve(query_str)
retrieved_nodes = retriever.retrieve(query_str)

使用 LLM 构建响应合成¶

在本节中，我们将展示如何使用 LLM + Prompt 来构建响应合成模块。

我们将从简单的策略（仅将上下文填充到 Prompt 中）开始，逐步介绍可以处理上下文溢出的更高级策略。

1. 尝试一个简单的 Prompt¶

我们首先尝试使用单个输入 Prompt + LLM 调用来合成响应。

In [ ]

已复制！

from llama_index.llms.openai import OpenAI
from llama_index.core import PromptTemplate

llm = OpenAI(model="text-davinci-003")
from llama_index.llms.openai import OpenAI from llama_index.core import PromptTemplate llm = OpenAI(model="text-davinci-003")

In [ ]

已复制！





qa_prompt = PromptTemplate(
    """\
Context information is below.
---------------------
{context_str}
---------------------
Given the context information and not prior knowledge, answer the query.
Query: {query_str}
Answer: \
"""
)
qa_prompt = PromptTemplate( """\ Context information is below. --------------------- {context_str} --------------------- Given the context information and not prior knowledge, answer the query. Query: {query_str} Answer: \ """ )

给定一个示例问题，检索相关的节点集，并尝试将其全部放入 Prompt 中，用换行符分隔。

In [ ]

已复制！

query_str = (
    "Can you tell me about results from RLHF using both model-based and"
    " human-based evaluation?"
)
query_str = ( "Can you tell me about results from RLHF using both model-based and" " human-based evaluation?" )

In [ ]

已复制！

retrieved_nodes = retriever.retrieve(query_str)
retrieved_nodes = retriever.retrieve(query_str)

In [ ]

已复制！





def generate_response(retrieved_nodes, query_str, qa_prompt, llm):
    context_str = "\n\n".join([r.get_content() for r in retrieved_nodes])
    fmt_qa_prompt = qa_prompt.format(
        context_str=context_str, query_str=query_str
    )
    response = llm.complete(fmt_qa_prompt)
    return str(response), fmt_qa_prompt
def generate_response(retrieved_nodes, query_str, qa_prompt, llm): context_str = "\n\n".join([r.get_content() for r in retrieved_nodes]) fmt_qa_prompt = qa_prompt.format( context_str=context_str, query_str=query_str ) response = llm.complete(fmt_qa_prompt) return str(response), fmt_qa_prompt

In [ ]

已复制！

response, fmt_qa_prompt = generate_response(
    retrieved_nodes, query_str, qa_prompt, llm
)
response, fmt_qa_prompt = generate_response( retrieved_nodes, query_str, qa_prompt, llm )

In [ ]

已复制！

print(f"*****Response******:\n{response}\n\n")
print(f"*****Response******:\n{response}\n\n")

*****Response******:

RLHF used both model-based and human-based evaluation to select the best-performing models among several ablations. Model-based evaluation was used to measure the robustness of the reward model by collecting a test set of prompts for both helpfulness and safety, and asking three annotators to judge the quality of the answers based on a 7-point Likert scale. Human evaluation was used to validate major model versions. Additionally, a more general reward was trained to ensure the measure wouldn't diverge from the human preferences. Results showed that the reward models were well calibrated with the human preference annotations.

In [ ]

已复制！

print(f"*****Formatted Prompt*****:\n{fmt_qa_prompt}\n\n")
print(f"*****Formatted Prompt*****:\n{fmt_qa_prompt}\n\n")

*****Formatted Prompt*****:
Context information is below.
---------------------
3.4
RLHF Results
3.4.1
Model-Based Evaluation
Evaluating LLMs is a challenging open-research problem. Human evaluation, while a gold standard, can
be complicated by various HCI considerations (Clark et al., 2021; Gehrmann et al., 2023), and is not always
scalable. Thus, to select the best-performing models among several ablations at each iteration from RLHF-V1
to V5, we first observed the improvement of the rewards from the latest reward models, to save costs and
increase iteration speed. We later validated major model versions with human evaluations.
How Far Can Model-Based Evaluation Go?
To measure the robustness of our reward model, we collected
a test set of prompts for both helpfulness and safety, and asked three annotators to judge the quality of the
answers based on a 7-point Likert scale (the higher the better). We observe that our reward models overall
are well calibrated with our human preference annotations, as illustrated in Figure 29 in the appendix. This
confirms the relevance of using our reward as a point-wise metric, despite being trained with a Pairwise
Ranking Loss.
Still, as Goodhart’s Law states, when a measure becomes a target, it ceases to be a good measure. To ensure
our measure won’t diverge from the human preferences, we additionally used a more general reward, trained
17

5
Discussion
Here, we discuss the interesting properties we have observed with RLHF (Section 5.1). We then discuss the
limitations of Llama 2-Chat (Section 5.2). Lastly, we present our strategy for responsibly releasing these
models (Section 5.3).
5.1
Learnings and Observations
Our tuning process revealed several interesting results, such as Llama 2-Chat’s abilities to temporally
organize its knowledge, or to call APIs for external tools.
SFT (Mix)
SFT (Annotation)
RLHF (V1)
0.0
0.2
0.4
0.6
0.8
1.0
Reward Model Score
RLHF (V2)
Figure 20: Distribution shift for progressive versions of Llama 2-Chat, from SFT models towards RLHF.
Beyond Human Supervision.
At the outset of the project, many among us expressed a preference for
supervised annotation, attracted by its denser signal. Meanwhile reinforcement learning, known for its insta-
bility, seemed a somewhat shadowy field for those in the NLP research community. However, reinforcement
learning proved highly effective, particularly given its cost and time effectiveness. Our findings underscore
that the crucial determinant of RLHF’s success lies in the synergy it fosters between humans and LLMs
throughout the annotation process.
Even with proficient annotators, each individual writes with significant variation. A model fine-tuned on
SFT annotation learns this diversity, including, unfortunately, the tail-end of poorly executed annotation. Fur-
thermore, the model’s performance is capped by the writing abilities of the most skilled annotators. Human
annotators are arguably less subject to discrepancy when comparing two outputs’ preference annotation
for RLHF. Consequently, the reward mechanism swiftly learns to assign low scores to undesirable tail-end
distribution and aligns towards the human preference. This phenomena is illustrated in Figure 20, where we
can see that the worst answers are progressively removed, shifting the distribution to the right.
In addition, during annotation, the model has the potential to venture into writing trajectories that even the
best annotators may not chart. Nonetheless, humans can still provide valuable feedback when comparing two
answers, beyond their own writing competencies. Drawing a parallel, while we may not all be accomplished
artists, our ability to appreciate and critique art remains intact. We posit that the superior writing abilities of
LLMs, as manifested in surpassing human annotators in certain tasks, are fundamentally driven by RLHF, as
documented in Gilardi et al. (2023) and Huang et al. (2023). Supervised data may no longer be the gold
standard, and this evolving circumstance compels a re-evaluation of the concept of “supervision.”
In-Context Temperature Rescaling.
We have observed an intriguing phenomenon related to RLHF, a feature
not previously reported to the best of our knowledge: the dynamic re-scaling of temperature contingent upon
the context. As indicated in Figure 8, the temperature appears to be influenced by RLHF. Yet, intriguingly,
our findings also revealed that the shifts are not uniformly applied across all prompts, as shown in Figure 21.
For instance, when it comes to prompts associated with creativity, such as “Write a poem,” an increase in
temperature continues to generate diversity across our various RLHF iterations. This can be observed in the
Self-BLEU slope, which mirrors a pattern comparable to that of the SFT model.
On the other hand, for prompts based on factual information, such as “What is the capital of ?” the Self-BLEU
slope diminishes over time. This pattern suggests that despite the rising temperature, the model learns to
consistently provide the same response to factual prompts.
32
---------------------
Given the context information and not prior knowledge, answer the query.
Query: Can you tell me about results from RLHF using both model-based and human-based evaluation?
Answer:

问题：如果我们把 top-k 检索器的值设得更高怎么办？上下文就会溢出！

In [ ]

已复制！

retriever = index.as_retriever(similarity_top_k=6)
retrieved_nodes = retriever.retrieve(query_str)
retriever = index.as_retriever(similarity_top_k=6) retrieved_nodes = retriever.retrieve(query_str)

In [ ]

已复制！

response, fmt_qa_prompt = generate_response(
    retrieved_nodes, query_str, qa_prompt, llm
)
print(f"Response (k=5): {response}")
response, fmt_qa_prompt = generate_response( retrieved_nodes, query_str, qa_prompt, llm ) print(f"Response (k=5): {response}")

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[34], line 1
----> 1 response, fmt_qa_prompt = generate_response(retrieved_nodes, query_str, qa_prompt, llm)
      2 print(f'Response (k=5): {response}')

Cell In[16], line 4, in generate_response(retrieved_nodes, query_str, qa_prompt, llm)
      2 context_str = "\n\n".join([r.get_content() for r in retrieved_nodes])
      3 fmt_qa_prompt = qa_prompt.format(context_str=context_str, query_str=query_str)
----> 4 response = llm.complete(fmt_qa_prompt)
      5 return str(response), fmt_qa_prompt

File ~/Programming/gpt_index/llama_index/llms/base.py:277, in llm_completion_callback.<locals>.wrap.<locals>.wrapped_llm_predict(_self, *args, **kwargs)
    267 with wrapper_logic(_self) as callback_manager:
    268     event_id = callback_manager.on_event_start(
    269         CBEventType.LLM,
    270         payload={
   (...)
    274         },
    275     )
--> 277     f_return_val = f(_self, *args, **kwargs)
    278     if isinstance(f_return_val, Generator):
    279         # intercept the generator and add a callback to the end
    280         def wrapped_gen() -> CompletionResponseGen:

File ~/Programming/gpt_index/llama_index/llms/openai.py:144, in OpenAI.complete(self, prompt, **kwargs)
    142 else:
    143     complete_fn = self._complete
--> 144 return complete_fn(prompt, **kwargs)

File ~/Programming/gpt_index/llama_index/llms/openai.py:281, in OpenAI._complete(self, prompt, **kwargs)
    278 all_kwargs = self._get_all_kwargs(**kwargs)
    279 if self.max_tokens is None:
    280     # NOTE: non-chat completion endpoint requires max_tokens to be set
--> 281     max_tokens = self._get_max_token_for_prompt(prompt)
    282     all_kwargs["max_tokens"] = max_tokens
    284 response = completion_with_retry(
    285     is_chat_model=self._is_chat_model,
    286     max_retries=self.max_retries,
   (...)
    289     **all_kwargs,
    290 )

File ~/Programming/gpt_index/llama_index/llms/openai.py:343, in OpenAI._get_max_token_for_prompt(self, prompt)
    341 max_token = context_window - len(tokens)
    342 if max_token <= 0:
--> 343     raise ValueError(
    344         f"The prompt is too long for the model. "
    345         f"Please use a prompt that is less than {context_window} tokens."
    346     )
    347 return max_token

ValueError: The prompt is too long for the model. Please use a prompt that is less than 4097 tokens.

2. 尝试“创建并细化”策略¶

为了处理上下文溢出，我们可以尝试一种策略：按顺序遍历所有节点来合成响应。从第一个节点开始生成一个初始响应。然后对于后续节点，使用附加的上下文来细化答案。

这也需要我们定义一个“细化”Prompt。

In [ ]

已复制！





refine_prompt = PromptTemplate(
    """\
The original query is as follows: {query_str}
We have provided an existing answer: {existing_answer}
We have the opportunity to refine the existing answer \
(only if needed) with some more context below.
------------
{context_str}
------------
Given the new context, refine the original answer to better answer the query. \
If the context isn't useful, return the original answer.
Refined Answer: \
"""
)
refine_prompt = PromptTemplate( """\ The original query is as follows: {query_str} We have provided an existing answer: {existing_answer} We have the opportunity to refine the existing answer \ (only if needed) with some more context below. ------------ {context_str} ------------ Given the new context, refine the original answer to better answer the query. \ If the context isn't useful, return the original answer. Refined Answer: \ """ )

In [ ]

已复制！





from llama_index.core.response.notebook_utils import display_source_node


def generate_response_cr(
    retrieved_nodes, query_str, qa_prompt, refine_prompt, llm
):
    """Generate a response using create and refine strategy.

    The first node uses the 'QA' prompt.
    All subsequent nodes use the 'refine' prompt.

    """
    cur_response = None
    fmt_prompts = []
    for idx, node in enumerate(retrieved_nodes):
        print(f"[Node {idx}]")
        display_source_node(node, source_length=2000)
        context_str = node.get_content()
        if idx == 0:
            fmt_prompt = qa_prompt.format(
                context_str=context_str, query_str=query_str
            )
        else:
            fmt_prompt = refine_prompt.format(
                context_str=context_str,
                query_str=query_str,
                existing_answer=str(cur_response),
            )

        cur_response = llm.complete(fmt_prompt)
        fmt_prompts.append(fmt_prompt)

    return str(cur_response), fmt_prompts
from llama_index.core.response.notebook_utils import display_source_node def generate_response_cr( retrieved_nodes, query_str, qa_prompt, refine_prompt, llm ): """使用创建并细化策略生成响应。第一个节点使用“QA”Prompt。所有后续节点使用“细化”Prompt。 """ cur_response = None fmt_prompts = [] for idx, node in enumerate(retrieved_nodes): print(f"[节点 {idx}]") display_source_node(node, source_length=2000) context_str = node.get_content() if idx == 0: fmt_prompt = qa_prompt.format( context_str=context_str, query_str=query_str ) else: fmt_prompt = refine_prompt.format( context_str=context_str, query_str=query_str, existing_answer=str(cur_response), ) cur_response = llm.complete(fmt_prompt) fmt_prompts.append(fmt_prompt) return str(cur_response), fmt_prompts

In [ ]

已复制！

response, fmt_prompts = generate_response_cr(
    retrieved_nodes, query_str, qa_prompt, refine_prompt, llm
)
response, fmt_prompts = generate_response_cr( retrieved_nodes, query_str, qa_prompt, refine_prompt, llm )

In [ ]

已复制！

print(str(response))
print(str(response))

In [ ]

已复制！

# view a sample qa prompt
print(fmt_prompts[0])
# 查看示例 QA Prompt print(fmt_prompts[0])

In [ ]

已复制！

# view a sample refine prompt
print(fmt_prompts[1])
# 查看示例细化 Prompt print(fmt_prompts[1])

观察：这是初步的一步，但显然存在效率低下的问题。一个问题是它相当慢——我们进行的是顺序调用。第二个问题是每个 LLM 调用效率不高——我们只插入一个节点，而没有“填充” Prompt 尽可能多的上下文。

3. 尝试层级摘要策略¶

另一种方法是尝试层级摘要策略。我们独立为每个节点生成答案，然后层级地组合这些答案。这个“组合”步骤可以发生一次，或者为了最大限度的通用性，可以递归进行，直到只有一个“根”节点。然后将该“根”节点作为答案返回。

我们在下面实现这种方法。我们固定子节点数为 5，因此我们每次层级地组合 5 个子节点。

注意：在 LlamaIndex 中，这被称为 "tree_summarize"，在 LangChain 中，这被称为 map-reduce。

In [ ]

已复制！





def combine_results(
    texts,
    query_str,
    qa_prompt,
    llm,
    cur_prompt_list,
    num_children=10,
):
    new_texts = []
    for idx in range(0, len(texts), num_children):
        text_batch = texts[idx : idx + num_children]
        context_str = "\n\n".join([t for t in text_batch])
        fmt_qa_prompt = qa_prompt.format(
            context_str=context_str, query_str=query_str
        )
        combined_response = llm.complete(fmt_qa_prompt)
        new_texts.append(str(combined_response))
        cur_prompt_list.append(fmt_qa_prompt)

    if len(new_texts) == 1:
        return new_texts[0]
    else:
        return combine_results(
            new_texts, query_str, qa_prompt, llm, num_children=num_children
        )


def generate_response_hs(
    retrieved_nodes, query_str, qa_prompt, llm, num_children=10
):
    """Generate a response using hierarchical summarization strategy.

    Combine num_children nodes hierarchically until we get one root node.

    """
    fmt_prompts = []
    node_responses = []
    for node in retrieved_nodes:
        context_str = node.get_content()
        fmt_qa_prompt = qa_prompt.format(
            context_str=context_str, query_str=query_str
        )
        node_response = llm.complete(fmt_qa_prompt)
        node_responses.append(node_response)
        fmt_prompts.append(fmt_qa_prompt)

    response_txt = combine_results(
        [str(r) for r in node_responses],
        query_str,
        qa_prompt,
        llm,
        fmt_prompts,
        num_children=num_children,
    )

    return response_txt, fmt_prompts
def combine_results( texts, query_str, qa_prompt, llm, cur_prompt_list, num_children=10, ): new_texts = [] for idx in range(0, len(texts), num_children): text_batch = texts[idx : idx + num_children] context_str = "\n\n".join([t for t in text_batch]) fmt_qa_prompt = qa_prompt.format( context_str=context_str, query_str=query_str ) combined_response = llm.complete(fmt_qa_prompt) new_texts.append(str(combined_response)) cur_prompt_list.append(fmt_qa_prompt) if len(new_texts) == 1: return new_texts[0] else: return combine_results( new_texts, query_str, qa_prompt, llm, num_children=num_children ) def generate_response_hs( retrieved_nodes, query_str, qa_prompt, llm, num_children=10 ): """使用层级摘要策略生成响应。层级地组合 num_children 个节点，直到得到一个根节点。 """ fmt_prompts = [] node_responses = [] for node in retrieved_nodes: context_str = node.get_content() fmt_qa_prompt = qa_prompt.format( context_str=context_str, query_str=query_str ) node_response = llm.complete(fmt_qa_prompt) node_responses.append(node_response) fmt_prompts.append(fmt_qa_prompt) response_txt = combine_results( [str(r) for r in node_responses], query_str, qa_prompt, llm, fmt_prompts, num_children=num_children, ) return response_txt, fmt_prompts

In [ ]

已复制！

response, fmt_prompts = generate_response_hs(
    retrieved_nodes, query_str, qa_prompt, llm
)
response, fmt_prompts = generate_response_hs( retrieved_nodes, query_str, qa_prompt, llm )

In [ ]

已复制！

print(str(response))
print(str(response))

The results from RLHF using both model-based and human-based evaluation showed that Llama 2-Chat models outperformed open-source models by a significant margin on both single turn and multi-turn prompts. For human-based evaluation, we compared Llama 2-Chat models to open-source models and closed-source models on over 4,000 single and multi-turn prompts. The results showed that Llama 2-Chat models outperformed the other models by a significant margin on both single turn and multi-turn prompts. The human preference annotation agreement rate was also higher on more distinct responses than similar pairs. The largest RLHF model was competitive with ChatGPT, with a win rate of 36% and a tie rate of 31.5% relative to ChatGPT. RLHF 70B model also outperformed PaLM-bison chat model by a large percentage on the prompt set.

观察：请注意，答案比创建并细化方法要简洁得多。这是一个众所周知的现象——原因是层级摘要倾向于在每个阶段压缩信息，而创建并细化则鼓励在每个节点添加更多信息。

观察：与上面一节类似，也存在效率低下的问题。我们仍然为每个节点独立生成答案，这是我们可以尝试优化的。

我们的 ResponseSynthesizer 模块处理了这个问题！

4. [可选] 让我们创建一个层级摘要的异步版本！¶

层级摘要方法的一个优点是 LLM 调用可以并行化，从而大大加快响应合成的速度。

我们在下面实现了一个异步版本。我们使用 asyncio.gather 并发执行每个节点的协程（LLM 调用）。

In [ ]

已复制！

import nest_asyncio
import asyncio

nest_asyncio.apply()
import nest_asyncio import asyncio nest_asyncio.apply()

In [ ]

已复制！





async def acombine_results(
    texts,
    query_str,
    qa_prompt,
    llm,
    cur_prompt_list,
    num_children=10,
):
    fmt_prompts = []
    for idx in range(0, len(texts), num_children):
        text_batch = texts[idx : idx + num_children]
        context_str = "\n\n".join([t for t in text_batch])
        fmt_qa_prompt = qa_prompt.format(
            context_str=context_str, query_str=query_str
        )
        fmt_prompts.append(fmt_qa_prompt)
        cur_prompt_list.append(fmt_qa_prompt)

    tasks = [llm.acomplete(p) for p in fmt_prompts]
    combined_responses = await asyncio.gather(*tasks)
    new_texts = [str(r) for r in combined_responses]

    if len(new_texts) == 1:
        return new_texts[0]
    else:
        return await acombine_results(
            new_texts, query_str, qa_prompt, llm, num_children=num_children
        )


async def agenerate_response_hs(
    retrieved_nodes, query_str, qa_prompt, llm, num_children=10
):
    """Generate a response using hierarchical summarization strategy.

    Combine num_children nodes hierarchically until we get one root node.

    """
    fmt_prompts = []
    node_responses = []
    for node in retrieved_nodes:
        context_str = node.get_content()
        fmt_qa_prompt = qa_prompt.format(
            context_str=context_str, query_str=query_str
        )
        fmt_prompts.append(fmt_qa_prompt)

    tasks = [llm.acomplete(p) for p in fmt_prompts]
    node_responses = await asyncio.gather(*tasks)

    response_txt = combine_results(
        [str(r) for r in node_responses],
        query_str,
        qa_prompt,
        llm,
        fmt_prompts,
        num_children=num_children,
    )

    return response_txt, fmt_prompts
async def acombine_results( texts, query_str, qa_prompt, llm, cur_prompt_list, num_children=10, ): fmt_prompts = [] for idx in range(0, len(texts), num_children): text_batch = texts[idx : idx + num_children] context_str = "\n\n".join([t for t in text_batch]) fmt_qa_prompt = qa_prompt.format( context_str=context_str, query_str=query_str ) fmt_prompts.append(fmt_qa_prompt) cur_prompt_list.append(fmt_qa_prompt) tasks = [llm.acomplete(p) for p in fmt_prompts] combined_responses = await asyncio.gather(*tasks) new_texts = [str(r) for r in combined_responses] if len(new_texts) == 1: return new_texts[0] else: return await acombine_results( new_texts, query_str, qa_prompt, llm, num_children=num_children ) async def agenerate_response_hs( retrieved_nodes, query_str, qa_prompt, llm, num_children=10 ): """使用层级摘要策略生成响应。层级地组合 num_children 个节点，直到得到一个根节点。 """ fmt_prompts = [] node_responses = [] for node in retrieved_nodes: context_str = node.get_content() fmt_qa_prompt = qa_prompt.format( context_str=context_str, query_str=query_str ) fmt_prompts.append(fmt_qa_prompt) tasks = [llm.acomplete(p) for p in fmt_prompts] node_responses = await asyncio.gather(*tasks) response_txt = combine_results( [str(r) for r in node_responses], query_str, qa_prompt, llm, fmt_prompts, num_children=num_children, ) return response_txt, fmt_prompts

In [ ]

已复制！

response, fmt_prompts = await agenerate_response_hs(
    retrieved_nodes, query_str, qa_prompt, llm
)
response, fmt_prompts = await agenerate_response_hs( retrieved_nodes, query_str, qa_prompt, llm )

In [ ]

已复制！

print(str(response))
print(str(response))

 Results from RLHF using both model-based and human-based evaluation show that larger models generally obtain higher performance for a similar volume of data. Additionally, the accuracy on more distinct responses matters the most to improve Llama 2-Chat performance. The human preference annotation agreement rate is also higher on more distinct responses than similar pairs. Furthermore, two main algorithms were explored for RLHF fine-tuning: Proximal Policy Optimization (PPO) and Rejection Sampling fine-tuning. The largest Llama 2-Chat model was found to be competitive with ChatGPT, with a win rate of 36% and a tie rate of 31.5% relative to ChatGPT. Additionally, Llama 2-Chat 70B model outperformed PaLM-bison chat model by a large percentage on our prompt set. Inter-Rater Reliability (IRR) was measured using Gwet’s AC1/2 statistic, with scores varying between 0.37 and 0.55 depending on the specific model comparison.

让我们把它们整合起来！¶

让我们定义一个简单的查询引擎，可以用检索器、prompt、llm 等进行初始化。并让它实现一个简单的 query 函数。我们还实现了异步版本，如果您完成了上面的第 4 部分，就可以使用它！

注意：我们跳过了对我们自己的 QueryEngine 抽象进行子类化。这是一个重要的待办事项 (TODO)，以便使其更容易进行子类化！

In [ ]

已复制！





from llama_index.core.retrievers import BaseRetriever
from llama_index.core.llms import LLM
from dataclasses import dataclass
from typing import Optional, List


@dataclass
class Response:
    response: str
    source_nodes: Optional[List] = None

    def __str__(self):
        return self.response


class MyQueryEngine:
    """My query engine.

    Uses the tree summarize response synthesis module by default.

    """

    def __init__(
        self,
        retriever: BaseRetriever,
        qa_prompt: PromptTemplate,
        llm: LLM,
        num_children=10,
    ) -> None:
        self._retriever = retriever
        self._qa_prompt = qa_prompt
        self._llm = llm
        self._num_children = num_children

    def query(self, query_str: str):
        retrieved_nodes = self._retriever.retrieve(query_str)
        response_txt, _ = generate_response_hs(
            retrieved_nodes,
            query_str,
            self._qa_prompt,
            self._llm,
            num_children=self._num_children,
        )
        response = Response(response_txt, source_nodes=retrieved_nodes)
        return response

    async def aquery(self, query_str: str):
        retrieved_nodes = await self._retriever.aretrieve(query_str)
        response_txt, _ = await agenerate_response_hs(
            retrieved_nodes,
            query_str,
            self._qa_prompt,
            self._llm,
            num_children=self._num_children,
        )
        response = Response(response_txt, source_nodes=retrieved_nodes)
        return response
from llama_index.core.retrievers import BaseRetriever from llama_index.core.llms import LLM from dataclasses import dataclass from typing import Optional, List @dataclass class Response: response: str source_nodes: Optional[List] = None def __str__(self): return self.response class MyQueryEngine: """我的查询引擎。默认使用树状摘要响应合成模块。 """ def __init__( self, retriever: BaseRetriever, qa_prompt: PromptTemplate, llm: LLM, num_children=10, ) -> None: self._retriever = retriever self._qa_prompt = qa_prompt self._llm = llm self._num_children = num_children def query(self, query_str: str): retrieved_nodes = self._retriever.retrieve(query_str) response_txt, _ = generate_response_hs( retrieved_nodes, query_str, self._qa_prompt, self._llm, num_children=self._num_children, ) response = Response(response_txt, source_nodes=retrieved_nodes) return response async def aquery(self, query_str: str): retrieved_nodes = await self._retriever.aretrieve(query_str) response_txt, _ = await agenerate_response_hs( retrieved_nodes, query_str, self._qa_prompt, self._llm, num_children=self._num_children, ) response = Response(response_txt, source_nodes=retrieved_nodes) return response

In [ ]

已复制！

query_engine = MyQueryEngine(retriever, qa_prompt, llm, num_children=10)
query_engine = MyQueryEngine(retriever, qa_prompt, llm, num_children=10)

In [ ]

已复制！

response = query_engine.query(query_str)
response = query_engine.query(query_str)

In [ ]

已复制！

print(str(response))
print(str(response))

The results from RLHF using both model-based and human-based evaluation showed that larger models generally obtained higher performance for a similar volume of data. The accuracy on more distinct responses was higher than on similar pairs, indicating that learning to model human preferences becomes challenging when deciding between two similar model responses. Additionally, the largest Llama 2-Chat model was found to be competitive with ChatGPT, with a win rate of 36% and a tie rate of 31.5% relative to ChatGPT. Llama 2-Chat 70B model was also found to outperform PaLM-bison chat model by a large percentage on the prompt set. Inter-Rater Reliability (IRR) was measured using Gwet’s AC1/2 statistic, with scores varying between 0.37 and 0.55 depending on the specific model comparison.

In [ ]

已复制！

response = await query_engine.aquery(query_str)
response = await query_engine.aquery(query_str)

In [ ]

已复制！

print(str(response))
print(str(response))

The results from RLHF using both model-based and human-based evaluation showed that larger models generally obtained higher performance for a similar volume of data. The accuracy on more distinct responses was higher than on similar pairs, indicating that learning to model human preferences becomes challenging when deciding between two similar model responses. Additionally, the largest Llama 2-Chat model was found to be competitive with ChatGPT, with a win rate of 36% and a tie rate of 31.5%. Human evaluations were conducted using a 7-point Likert scale helpfulness task, with Gwet’s AC2 score varying between 0.37 and 0.55 depending on the specific model comparison.