使用回忆任务对长上下文LLM进行压力测试¶
在本节中,我们将对GPT-4和Claude v2的长上下文回忆能力进行压力测试。这受到了Greg Kamradt推文的启发。
类似地,我们分析了长上下文LLM的“大海捞针”式回忆能力。我们通过以下方式进行了增量扩展:1) 添加Claude,以及 2) 测试当上下文超出上下文窗口时触发响应合成策略的回忆能力。
我们使用一个固定文档——2021年Uber 10-K,它包含约29万个token。
In [ ]
已复制!
%pip install llama-index-llms-openai
%pip install llama-index-llms-anthropic
%pip install llama-index-llms-openai %pip install llama-index-llms-anthropic
In [ ]
已复制!
import nest_asyncio
nest_asyncio.apply()
import nest_asyncio nest_asyncio.apply()
In [ ]
已复制!
from llama_index.core import SimpleDirectoryReader, Document
from llama_index.core import SummaryIndex
from llama_index.llms.openai import OpenAI
from llama_index.llms.anthropic import Anthropic
from llama_index.core.evaluation import CorrectnessEvaluator
from llama_index.core import SimpleDirectoryReader, Document from llama_index.core import SummaryIndex from llama_index.llms.openai import OpenAI from llama_index.llms.anthropic import Anthropic from llama_index.core.evaluation import CorrectnessEvaluator
设置数据/索引¶
我们加载Uber 10-k
In [ ]
已复制!
!mkdir -p 'data/10k/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10k/uber_2021.pdf' -O 'data/10k/uber_2021.pdf'
!mkdir -p 'data/10k/' !wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10k/uber_2021.pdf' -O 'data/10k/uber_2021.pdf'
--2023-11-09 00:35:55-- https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10k/uber_2021.pdf Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8000::154, 2606:50c0:8002::154, 2606:50c0:8003::154, ... Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8000::154|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 1880483 (1.8M) [application/octet-stream] Saving to: ‘data/10k/uber_2021.pdf’ data/10k/uber_2021. 100%[===================>] 1.79M --.-KB/s in 0.1s 2023-11-09 00:36:04 (18.2 MB/s) - ‘data/10k/uber_2021.pdf’ saved [1880483/1880483] --2023-11-09 00:36:04-- https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10k/lyft_2021.pdf Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8000::154, 2606:50c0:8002::154, 2606:50c0:8003::154, ... Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8000::154|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 1440303 (1.4M) [application/octet-stream] Saving to: ‘data/10k/lyft_2021.pdf’ data/10k/lyft_2021. 100%[===================>] 1.37M --.-KB/s in 0.06s 2023-11-09 00:36:05 (24.7 MB/s) - ‘data/10k/lyft_2021.pdf’ saved [1440303/1440303]
In [ ]
已复制!
## load data
uber_docs0 = SimpleDirectoryReader(
input_files=["./data/10k/uber_2021.pdf"]
).load_data()
uber_doc = Document(text="\n\n".join([d.get_content() for d in uber_docs0]))
## load data uber_docs0 = SimpleDirectoryReader( input_files=["./data/10k/uber_2021.pdf"] ).load_data() uber_doc = Document(text="\n\n".join([d.get_content() for d in uber_docs0]))
我们在下方打印token数量。请注意,这超出了现有LLM的上下文窗口,需要响应合成策略。
In [ ]
已复制!
# count the number of tokens
from llama_index.core.utils import globals_helper
num_tokens = len(globals_helper.tokenizer(uber_doc.get_content()))
print(f"NUM TOKENS: {num_tokens}")
# count the number of tokens from llama_index.core.utils import globals_helper num_tokens = len(globals_helper.tokenizer(uber_doc.get_content())) print(f"NUM TOKENS: {num_tokens}")
NUM TOKENS: 291129
尝试不同的实验¶
定义上下文字符串¶
这里我们插入一个上下文句子,并将其“隐藏”在整个文档中的不同位置。
In [ ]
已复制!
context_str = "Jerry's favorite snack is Hot Cheetos."
query_str = "What is Jerry's favorite snack?"
context_str = "Jerry's favorite snack is Hot Cheetos." query_str = "What is Jerry's favorite snack?"
In [ ]
已复制!
def augment_doc(doc_str, context, position):
"""Augment doc with additional context at a given position."""
doc_str1 = doc_str[:position]
doc_str2 = doc_str[position:]
return f"{doc_str1}...\n\n{context}\n\n...{doc_str2}"
def augment_doc(doc_str, context, position): """Augment doc with additional context at a given position.""" doc_str1 = doc_str[:position] doc_str2 = doc_str[position:] return f"{doc_str1}...\n\n{context}\n\n...{doc_str2}"
In [ ]
已复制!
test_str = augment_doc(
uber_doc.get_content(), context_str, int(0.5 * len(uber_doc.get_content()))
)
test_str = augment_doc( uber_doc.get_content(), context_str, int(0.5 * len(uber_doc.get_content())) )
定义实验循环¶
实验循环如下
- 遍历位置集合 (由相对于文档长度的百分位数表示)
- 对于每个位置,在该位置注入上下文字符串。
- 将整个文档加载到我们的
SummaryIndex
中,获取相应的查询引擎。 - 当提出问题时,我们对整个文档触发响应合成 (创建并精炼,或树状总结)。
- 使用我们的
CorrectnessEvaluator
比较预测响应与预期响应
In [ ]
已复制!
async def run_experiments(
doc, position_percentiles, context_str, query, llm, response_mode="compact"
):
eval_llm = OpenAI(model="gpt-4-1106-preview")
correctness_evaluator = CorrectnessEvaluator(llm=eval_llm)
eval_scores = {}
for idx, position_percentile in enumerate(position_percentiles):
print(f"Position percentile: {position_percentile}")
position_idx = int(position_percentile * len(uber_doc.get_content()))
new_doc_str = augment_doc(
uber_doc.get_content(), context_str, position_idx
)
new_doc = Document(text=new_doc_str)
index = SummaryIndex.from_documents(
[new_doc],
)
query_engine = index.as_query_engine(
response_mode=response_mode, llm=llm
)
print(f"Query: {query}")
# uncomment for async
# response = await query_engine.aquery(query)
response = query_engine.query(query)
print(f"Response: {str(response)}")
eval_result = correctness_evaluator.evaluate(
query=query, response=str(response), reference=context_str
)
eval_score = eval_result.score
print(f"Eval score: {eval_score}")
eval_scores[position_percentile] = eval_score
return eval_scores
async def run_experiments( doc, position_percentiles, context_str, query, llm, response_mode="compact" ): eval_llm = OpenAI(model="gpt-4-1106-preview") correctness_evaluator = CorrectnessEvaluator(llm=eval_llm) eval_scores = {} for idx, position_percentile in enumerate(position_percentiles): print(f"Position percentile: {position_percentile}") position_idx = int(position_percentile * len(uber_doc.get_content())) new_doc_str = augment_doc( uber_doc.get_content(), context_str, position_idx ) new_doc = Document(text=new_doc_str) index = SummaryIndex.from_documents( [new_doc], ) query_engine = index.as_query_engine( response_mode=response_mode, llm=llm ) print(f"Query: {query}") # uncomment for async # response = await query_engine.aquery(query) response = query_engine.query(query) print(f"Response: {str(response)}") eval_result = correctness_evaluator.evaluate( query=query, response=str(response), reference=context_str ) eval_score = eval_result.score print(f"Eval score: {eval_score}") eval_scores[position_percentile] = eval_score return eval_scores
In [ ]
已复制!
position_percentiles = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
position_percentiles = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
In [ ]
已复制!
llm = OpenAI(model="gpt-4-1106-preview")
eval_scores_gpt4 = await run_experiments(
[uber_doc],
position_percentiles,
context_str,
query_str,
llm,
response_mode="compact",
)
llm = OpenAI(model="gpt-4-1106-preview") eval_scores_gpt4 = await run_experiments( [uber_doc], position_percentiles, context_str, query_str, llm, response_mode="compact", )
Position percentile: 0.0 Query: What is Jerry's favorite snack? Response: Hot Cheetos. Eval score: 5.0 Position percentile: 0.1 Query: What is Jerry's favorite snack? Response: Hot Cheetos. Eval score: 5.0 Position percentile: 0.2 Query: What is Jerry's favorite snack? Response: Hot Cheetos. Eval score: 5.0 Position percentile: 0.3 Query: What is Jerry's favorite snack? Response: Hot Cheetos. Eval score: 5.0 Position percentile: 0.4 Query: What is Jerry's favorite snack? Response: Hot Cheetos. Eval score: 5.0 Position percentile: 0.5 Query: What is Jerry's favorite snack? Response: Jerry's favorite snack is not specified in the provided information. Eval score: 2.0 Position percentile: 0.6 Query: What is Jerry's favorite snack? Response: Repeat the original answer. Eval score: 1.0 Position percentile: 0.7 Query: What is Jerry's favorite snack? Response: Repeat the original answer. Eval score: 1.0 Position percentile: 0.8 Query: What is Jerry's favorite snack? Response: Jerry's favorite snack is Hot Cheetos. Eval score: 5.0 Position percentile: 0.9 Query: What is Jerry's favorite snack? Response: Jerry's favorite snack is Hot Cheetos. Eval score: 5.0 Position percentile: 1.0 Query: What is Jerry's favorite snack? Response: Hot Cheetos. Eval score: 5.0
In [ ]
已复制!
llm = OpenAI(model="gpt-4-1106-preview")
eval_scores_gpt4_ts = await run_experiments(
[uber_doc],
position_percentiles,
context_str,
query_str,
llm,
response_mode="tree_summarize",
)
llm = OpenAI(model="gpt-4-1106-preview") eval_scores_gpt4_ts = await run_experiments( [uber_doc], position_percentiles, context_str, query_str, llm, response_mode="tree_summarize", )
Position percentile: 0.0 Query: What is Jerry's favorite snack? Response: Jerry's favorite snack is Hot Cheetos. Eval score: 5.0 Position percentile: 0.1 Query: What is Jerry's favorite snack? Response: It is not possible to determine Jerry's favorite snack from the information provided. Eval score: 1.0 Position percentile: 0.2 Query: What is Jerry's favorite snack? Response: It is not possible to determine Jerry's favorite snack as there is no information provided about Jerry or his snack preferences. Eval score: 2.0 Position percentile: 0.3 Query: What is Jerry's favorite snack? Response: Jerry's favorite snack is Hot Cheetos. Eval score: 5.0 Position percentile: 0.4 Query: What is Jerry's favorite snack? Response: It is not possible to determine Jerry's favorite snack from the information provided. Eval score: 1.0 Position percentile: 0.5 Query: What is Jerry's favorite snack? Response: It is not possible to determine Jerry's favorite snack from the information available. Eval score: 2.0 Position percentile: 0.6 Query: What is Jerry's favorite snack? Response: It is not possible to determine Jerry's favorite snack as there is no information provided about his preferences. Eval score: 2.0 Position percentile: 0.7 Query: What is Jerry's favorite snack? Response: It is not possible to determine Jerry's favorite snack from the information provided. Eval score: 1.0 Position percentile: 0.8 Query: What is Jerry's favorite snack? Response: It is not possible to determine Jerry's favorite snack as there is no information provided about Jerry's preferences. Eval score: 2.0 Position percentile: 0.9 Query: What is Jerry's favorite snack? Response: It is not possible to determine Jerry's favorite snack from the information provided. Eval score: 1.0 Position percentile: 1.0 Query: What is Jerry's favorite snack? Response: It is not possible to determine Jerry's favorite snack from the information available. Eval score: 2.0
In [ ]
已复制!
llm = Anthropic(model="claude-2")
eval_scores_anthropic = await run_experiments(
[uber_doc], position_percentiles, context_str, query_str, llm
)
llm = Anthropic(model="claude-2") eval_scores_anthropic = await run_experiments( [uber_doc], position_percentiles, context_str, query_str, llm )
Position percentile: 0.0 Query: What is Jerry's favorite snack? Response: Unfortunately I do not have enough context to determine what Jerry's favorite snack is, as the new context provided does not contain any information about his preferences or favorite snacks. Without more details about Jerry as an individual, I cannot refine my original answer about his favorite snack. I would need additional information about his tastes, habits, or direct statements from him about his snack preferences in order to update my response. The new context alone does not give me any clues to determine his favorite snack. Eval score: 2.0 Position percentile: 0.1 Query: What is Jerry's favorite snack? Response: I apologize, but the new context you provided does not contain any information about someone named Jerry or what his favorite snack is. The new context discusses an intercreditor agreement, secured obligations, liens and other legal/financial details related to Uber Technologies. It does not mention anything about a person named Jerry or his favorite snacks. Since the new context does not provide any relevant information, I do not have enough details to determine or provide an answer regarding Jerry's favorite snack. Eval score: 1.0 Position percentile: 0.2 Query: What is Jerry's favorite snack? Response: Unfortunately I do not have enough context to determine what Jerry's favorite snack is. The new context provided does not contain any specific information about Jerry's snack preferences. Without more details about Jerry as an individual, such as his tastes, habits, or direct statements about favorite snacks, I cannot confidently provide or refine an answer. I apologize that I cannot give a more substantive response without additional relevant information about Jerry's favorite snack. Eval score: 2.0 Position percentile: 0.3 Query: What is Jerry's favorite snack? Response: Unfortunately I do not have enough context to determine Jerry's favorite snack. The original answer stated that this information was not provided in the context, and the new context does not add any relevant details about Jerry's preferences. Since the original answer already indicated the favorite snack was not mentioned, and the new context does not provide any additional relevant information, I cannot refine the original answer further. Eval score: 2.0 Position percentile: 0.4 Query: What is Jerry's favorite snack? Response: Unfortunately I do not have enough context to determine what Jerry's favorite snack is. The passage provided is an excerpt from a loan agreement that does not mention anyone named Jerry or provide information about snack preferences. The passage defines lien intercreditor agreement terms, but does not give any information about a person named Jerry or their favorite snacks. Without more context about Jerry and his preferences, I cannot infer what his favorite snack is based solely on the loan agreement terms provided in the new context. Eval score: 1.0 Position percentile: 0.5 Query: What is Jerry's favorite snack? Response: I do not have enough context to determine what Jerry's favorite snack is, as the excerpt from Uber's annual 10-K filing provided does not mention anyone named Jerry or provide any details about snack preferences. The passage discusses Uber's business operations and financials, but does not contain information to infer what a specific person named Jerry's favorite snack may be. Without more details about who Jerry is and the context for the question, I cannot determine Jerry's favorite snack based solely on this 10-K excerpt. Eval score: 1.0 Position percentile: 0.6 Query: What is Jerry's favorite snack? Response: Unfortunately I do not have enough context to determine what Jerry's favorite snack is based on the new information provided. The new context is an excerpt from Uber's credit agreement and does not mention anything about a person named Jerry or provide any clues about his favorite snack. Since the new information does not give any details relevant to the original question, I cannot refine my previous response about Jerry's favorite snack. My original answer stating that there is insufficient information to determine Jerry's favorite snack remains applicable. Eval score: 1.0 Position percentile: 0.7 Query: What is Jerry's favorite snack? Response: I apologize, but the passage provided does not contain any information about someone named Jerry or their favorite snack. The passage discusses intercreditor agreements, secured obligations and other financial and legal matters unrelated to an individual's snack preferences. Without any context about Jerry or snacks mentioned in the passage, I do not have enough information to determine what Jerry's favorite snack is. Eval score: 1.0 Position percentile: 0.8 Query: What is Jerry's favorite snack? Response: Unfortunately I do not have enough context to determine what Jerry's favorite snack is, as the provided information is about Uber's financial agreements and does not mention anything about a person named Jerry or his snack preferences. The new context given does not provide any additional clues to help identify Jerry's favorite snack. I would need more specific information about Jerry as an individual to be able to determine his favorite snack. Eval score: 1.0 Position percentile: 0.9 Query: What is Jerry's favorite snack? Response: Unfortunately the new context you provided does not contain any information about Jerry or his favorite snack. The passage discusses intercreditor agreements, amendments, assignments, and other financial and legal matters related to Uber, but does not mention anything about a person named Jerry or provide any clues as to what his favorite snack might be. Since the new context does not add any relevant information about Jerry or his favorite snack, I do not have enough context to determine what Jerry's favorite snack is. My original response that there is insufficient context to determine Jerry's favorite snack remains applicable. Eval score: 1.0 Position percentile: 1.0 Query: What is Jerry's favorite snack? Response: Based on the additional context provided, Jerry's favorite snack is Hot Cheetos. Eval score: 5.0
In [ ]
已复制!
# NOTE: incomplete, running into timeout errors
llm = Anthropic(model="claude-2")
eval_scores_anthropic = await run_experiments(
[uber_doc],
position_percentiles,
context_str,
query_str,
llm,
response_mode="tree_summarize",
)
# NOTE: incomplete, running into timeout errors llm = Anthropic(model="claude-2") eval_scores_anthropic = await run_experiments( [uber_doc], position_percentiles, context_str, query_str, llm, response_mode="tree_summarize", )