Anthropic Prompt 缓存¶
在本 Notebook 中,我们将演示如何将 Anthropic Prompt Caching 与 LlamaIndex 的抽象一起使用。
通过在消息请求中标记 cache_control
来启用 Prompt 缓存。
Prompt 缓存如何工作¶
当您发送启用了 Prompt 缓存的请求时
- 系统检查提示前缀是否已从最近的查询中缓存。
- 如果找到,它将使用缓存版本,从而减少处理时间和成本。
- 否则,它将处理完整提示并缓存前缀供将来使用。
注意
A. Prompt 缓存适用于 Claude 3.5 Sonnet
、Claude 3 Haiku
和 Claude 3 Opus
模型。
B. 可缓存的最低提示长度是
1. 1024 tokens for Claude 3.5 Sonnet and Claude 3 Opus
2. 2048 tokens for Claude 3 Haiku
C. 较短的提示无法缓存,即使标记了 cache_control
。
设置 API 密钥¶
输入 [ ]
已复制!
import os
os.environ[
"ANTHROPIC_API_KEY"
] = "sk-ant-..." # replace with your Anthropic API key
import os os.environ[ "ANTHROPIC_API_KEY" ] = "sk-ant-..." # 替换为您的 Anthropic API 密钥
设置 LLM¶
输入 [ ]
已复制!
from llama_index.llms.anthropic import Anthropic
llm = Anthropic(model="claude-3-5-sonnet-20240620")
from llama_index.llms.anthropic import Anthropic llm = Anthropic(model="claude-3-5-sonnet-20240620")
下载数据¶
在此演示中,我们将使用 Paul Graham Essay
中的文本。我们将缓存该文本并基于它运行一些查询。
输入 [ ]
已复制!
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O './paul_graham_essay.txt'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O './paul_graham_essay.txt'
--2024-12-14 18:39:03-- https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.109.133, ... Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 75042 (73K) [text/plain] Saving to: ‘./paul_graham_essay.txt’ ./paul_graham_essay 100%[===================>] 73.28K --.-KB/s in 0.04s 2024-12-14 18:39:03 (1.62 MB/s) - ‘./paul_graham_essay.txt’ saved [75042/75042]
加载数据¶
输入 [ ]
已复制!
from llama_index.core import SimpleDirectoryReader
documents = SimpleDirectoryReader(
input_files=["./paul_graham_essay.txt"],
).load_data()
document_text = documents[0].text
from llama_index.core import SimpleDirectoryReader documents = SimpleDirectoryReader( input_files=["./paul_graham_essay.txt"], ).load_data() document_text = documents[0].text
Prompt 缓存¶
启用 Prompt 缓存
- 为您想要缓存的文本提示包含
"cache_control": {"type": "ephemeral"}
。 - 在请求中添加
extra_headers={"anthropic-beta": "prompt-caching-2024-07-31"}
。
我们可以通过检查以下参数来验证文本是否已缓存:
cache_creation_input_tokens:
创建新条目时写入缓存的 token 数。
cache_read_input_tokens:
为此请求从缓存中检索的 token 数。
input_tokens:
未从缓存读取或用于创建缓存的输入 token 数。
输入 [ ]
已复制!
from llama_index.core.llms import ChatMessage, TextBlock
messages = [
ChatMessage(role="system", content="You are helpful AI Assitant."),
ChatMessage(
role="user",
content=[
TextBlock(
text=f"{document_text}",
type="text",
),
TextBlock(
text="\n\nWhy did Paul Graham start YC?",
type="text",
),
],
additional_kwargs={"cache_control": {"type": "ephemeral"}},
),
]
resp = llm.chat(
messages, extra_headers={"anthropic-beta": "prompt-caching-2024-07-31"}
)
from llama_index.core.llms import ChatMessage, TextBlock messages = [ ChatMessage(role="system", content="You are helpful AI Assitant."), ChatMessage( role="user", content=[ TextBlock( text=f"{document_text}", type="text", ), TextBlock( text="\n\nWhy did Paul Graham start YC?", type="text", ), ], additional_kwargs={"cache_control": {"type": "ephemeral"}}, ), ] resp = llm.chat( messages, extra_headers={"anthropic-beta": "prompt-caching-2024-07-31"} )
让我们检查原始响应。
输入 [ ]
已复制!
resp.raw
resp.raw
输出 [ ]
{'id': 'msg_01PAaZDTjEqcZksFiiqYH42t', 'content': [TextBlock(text='Based on the essay, it seems Paul Graham started Y Combinator (YC) for a few key reasons:\n\n1. He had experience as a startup founder with Viaweb and wanted to help other founders avoid mistakes he had made.\n\n2. He had ideas about how venture capital could be improved, like making more smaller investments in younger technical founders.\n\n3. He was looking for something new to work on after selling Viaweb to Yahoo and trying painting for a while.\n\n4. He wanted to gain experience as an investor and thought funding a batch of startups at once would be a good way to do that.\n\n5. It started as a "Summer Founders Program" to give undergrads an alternative to summer internships, but quickly grew into something more serious.\n\n6. He saw an opportunity to scale startup funding by investing in batches of companies at once.\n\n7. He was excited by the potential to help create new startups and technologies.\n\n8. It allowed him to continue working with his friends/former colleagues Robert Morris and Trevor Blackwell.\n\n9. He had built an audience through his essays that provided deal flow for potential investments.\n\nSo in summary, it was a combination of wanting to help founders, improve venture capital, gain investing experience, work with friends, and leverage his existing audience/expertise in the startup world. The initial idea evolved quickly from a summer program into a new model for seed investing.', type='text')], 'model': 'claude-3-5-sonnet-20240620', 'role': 'assistant', 'stop_reason': 'end_turn', 'stop_sequence': None, 'type': 'message', 'usage': Usage(input_tokens=4, output_tokens=305, cache_creation_input_tokens=9, cache_read_input_tokens=17467)}
如您所见,由于我已经运行了几次,cache_creation_input_tokens
和 cache_read_input_tokens
都大于零,这表明文本已正确缓存。
现在,让我们对同一文档运行另一个查询。它应该从缓存中检索文档文本,这将反映在 cache_read_input_tokens
中。
输入 [ ]
已复制!
messages = [
ChatMessage(role="system", content="You are helpful AI Assitant."),
ChatMessage(
role="user",
content=[
TextBlock(
text=f"{document_text}",
type="text",
),
TextBlock(
text="\n\nWhat did Paul Graham do growing up?",
type="text",
),
],
additional_kwargs={"cache_control": {"type": "ephemeral"}},
),
]
resp = llm.chat(
messages, extra_headers={"anthropic-beta": "prompt-caching-2024-07-31"}
)
messages = [ ChatMessage(role="system", content="You are helpful AI Assitant."), ChatMessage( role="user", content=[ TextBlock( text=f"{document_text}", type="text", ), TextBlock( text="\n\nWhat did Paul Graham do growing up?", type="text", ), ], additional_kwargs={"cache_control": {"type": "ephemeral"}}, ), ] resp = llm.chat( messages, extra_headers={"anthropic-beta": "prompt-caching-2024-07-31"} )
输入 [ ]
已复制!
resp.raw
resp.raw
输出 [ ]
{'id': 'msg_011TQgbpBuBkZAJeatVVcqtp', 'content': [TextBlock(text='Based on the essay, here are some key things Paul Graham did growing up:\n\n1. As a teenager, he focused mainly on writing and programming outside of school. He tried writing short stories but says they were "awful".\n\n2. At age 13-14, he started programming on an IBM 1401 computer at his school district\'s data processing center. He used an early version of Fortran.\n\n3. In high school, he convinced his father to buy a TRS-80 microcomputer around 1980. He wrote simple games, a program to predict model rocket flight, and a word processor his father used.\n\n4. He went to college intending to study philosophy, but found it boring. He then decided to switch to studying artificial intelligence (AI).\n\n5. In college, he learned Lisp programming language, which expanded his concept of what programming could be. \n\n6. For his undergraduate thesis, he reverse-engineered SHRDLU, an early natural language processing program.\n\n7. He applied to grad schools for AI and ended up going to Harvard for graduate studies.\n\n8. In grad school, he realized AI as practiced then was not going to achieve true intelligence. He pivoted to focusing more on Lisp programming.\n\n9. He started writing a book about Lisp hacking while in grad school, which was eventually published in 1993 as "On Lisp".\n\nSo in summary, his early years were focused on writing, programming (especially Lisp), and studying AI, before he eventually moved on to other pursuits after grad school. The essay provides a detailed account of his intellectual development in these areas.', type='text')], 'model': 'claude-3-5-sonnet-20240620', 'role': 'assistant', 'stop_reason': 'end_turn', 'stop_sequence': None, 'type': 'message', 'usage': Usage(input_tokens=4, output_tokens=356, cache_creation_input_tokens=0, cache_read_input_tokens=17476)}
如您所见,响应是使用缓存文本生成的,如 cache_read_input_tokens
所示。