本地 Llama2 + VectorStoreIndex¶
本笔记本将引导您完成在本地使用 LlamaIndex 配置 llama-2 的正确步骤。请注意,运行此笔记本需要性能不错的 GPU,理想情况下是至少有 40GB 显存的 A100。
具体来说,我们将学习如何使用向量存储索引。
设置¶
In [ ]
已复制!
%pip install llama-index-llms-huggingface
%pip install llama-index-embeddings-huggingface
%pip install llama-index-llms-huggingface %pip install llama-index-embeddings-huggingface
In [ ]
已复制!
!pip install llama-index ipywidgets
!pip install llama-index ipywidgets
配置¶
重要提示:请使用有权访问 llama2 模型的账户登录到 HF Hub,在您的控制台中使用命令 huggingface-cli login
。更多详情,请参阅:https://ai.meta.com/resources/models-and-libraries/llama-downloads/。
In [ ]
已复制!
import logging
import sys
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
from IPython.display import Markdown, display
import logging import sys logging.basicConfig(stream=sys.stdout, level=logging.INFO) logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout)) from IPython.display import Markdown, display
In [ ]
已复制!
import torch
from llama_index.llms.huggingface import HuggingFaceLLM
from llama_index.core import PromptTemplate
# Model names (make sure you have access on HF)
LLAMA2_7B = "meta-llama/Llama-2-7b-hf"
LLAMA2_7B_CHAT = "meta-llama/Llama-2-7b-chat-hf"
LLAMA2_13B = "meta-llama/Llama-2-13b-hf"
LLAMA2_13B_CHAT = "meta-llama/Llama-2-13b-chat-hf"
LLAMA2_70B = "meta-llama/Llama-2-70b-hf"
LLAMA2_70B_CHAT = "meta-llama/Llama-2-70b-chat-hf"
selected_model = LLAMA2_13B_CHAT
SYSTEM_PROMPT = """You are an AI assistant that answers questions in a friendly manner, based on the given source documents. Here are some rules you always follow:
- Generate human readable output, avoid creating output with gibberish text.
- Generate only the requested output, don't include any other language before or after the requested output.
- Never say thank you, that you are happy to help, that you are an AI agent, etc. Just answer directly.
- Generate professional language typically used in business documents in North America.
- Never generate offensive or foul language.
"""
query_wrapper_prompt = PromptTemplate(
"[INST]<<SYS>>\n" + SYSTEM_PROMPT + "<</SYS>>\n\n{query_str}[/INST] "
)
llm = HuggingFaceLLM(
context_window=4096,
max_new_tokens=2048,
generate_kwargs={"temperature": 0.0, "do_sample": False},
query_wrapper_prompt=query_wrapper_prompt,
tokenizer_name=selected_model,
model_name=selected_model,
device_map="auto",
# change these settings below depending on your GPU
model_kwargs={"torch_dtype": torch.float16, "load_in_8bit": True},
)
import torch from llama_index.llms.huggingface import HuggingFaceLLM from llama_index.core import PromptTemplate # Model names (make sure you have access on HF) LLAMA2_7B = "meta-llama/Llama-2-7b-hf" LLAMA2_7B_CHAT = "meta-llama/Llama-2-7b-chat-hf" LLAMA2_13B = "meta-llama/Llama-2-13b-hf" LLAMA2_13B_CHAT = "meta-llama/Llama-2-13b-chat-hf" LLAMA2_70B = "meta-llama/Llama-2-70b-hf" LLAMA2_70B_CHAT = "meta-llama/Llama-2-70b-chat-hf" selected_model = LLAMA2_13B_CHAT SYSTEM_PROMPT = """You are an AI assistant that answers questions in a friendly manner, based on the given source documents. Here are some rules you always follow: - Generate human readable output, avoid creating output with gibberish text. - Generate only the requested output, don't include any other language before or after the requested output. - Never say thank you, that you are happy to help, that you are an AI agent, etc. Just answer directly. - Generate professional language typically used in business documents in North America. - Never generate offensive or foul language. """ query_wrapper_prompt = PromptTemplate( "[INST]<>\n" + SYSTEM_PROMPT + "< >\n\n{query_str}[/INST] " ) llm = HuggingFaceLLM( context_window=4096, max_new_tokens=2048, generate_kwargs={"temperature": 0.0, "do_sample": False}, query_wrapper_prompt=query_wrapper_prompt, tokenizer_name=selected_model, model_name=selected_model, device_map="auto", # change these settings below depending on your GPU model_kwargs={"torch_dtype": torch.float16, "load_in_8bit": True}, )
In [ ]
已复制!
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
from llama_index.embeddings.huggingface import HuggingFaceEmbedding embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
In [ ]
已复制!
from llama_index.core import Settings
Settings.llm = llm
Settings.embed_model = embed_model
from llama_index.core import Settings Settings.llm = llm Settings.embed_model = embed_model
下载数据
In [ ]
已复制!
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
!mkdir -p 'data/paul_graham/' !wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
In [ ]
已复制!
from llama_index.core import SimpleDirectoryReader
# load documents
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
from llama_index.core import SimpleDirectoryReader # load documents documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
In [ ]
已复制!
from llama_index.core import VectorStoreIndex
index = VectorStoreIndex.from_documents(documents)
from llama_index.core import VectorStoreIndex index = VectorStoreIndex.from_documents(documents)
查询¶
In [ ]
已复制!
# set Logging to DEBUG for more detailed outputs
query_engine = index.as_query_engine()
# set Logging to DEBUG for more detailed outputs query_engine = index.as_query_engine()
In [ ]
已复制!
response = query_engine.query("What did the author do growing up?")
display(Markdown(f"<b>{response}</b>"))
response = query_engine.query("What did the author do growing up?") display(Markdown(f"{response}"))
作者从小写短篇故事,在 IBM 1401 上编程,最终说服父亲给他买了一台 TRS-80 微型计算机。他写了简单的游戏,一个预测他的模型火箭能飞多高的程序,以及一个文字处理器。他在大学学习哲学,但最终转入了 AI。他写论文并在网上发表,从事垃圾邮件过滤和绘画工作。他还在每个周四晚上为一群朋友举办晚餐,并在剑桥买了一栋楼。
流式支持¶
In [ ]
已复制!
import time
query_engine = index.as_query_engine(streaming=True)
response = query_engine.query("What happened at interleaf?")
start_time = time.time()
token_count = 0
for token in response.response_gen:
print(token, end="")
token_count += 1
time_elapsed = time.time() - start_time
tokens_per_second = token_count / time_elapsed
print(f"\n\nStreamed output at {tokens_per_second} tokens/s")
import time query_engine = index.as_query_engine(streaming=True) response = query_engine.query("What happened at interleaf?") start_time = time.time() token_count = 0 for token in response.response_gen: print(token, end="") token_count += 1 time_elapsed = time.time() - start_time tokens_per_second = token_count / time_elapsed print(f"\n\n在 {tokens_per_second} token/秒 下流式输出")
At Interleaf, a group of people worked on projects for customers. One of the employees told the narrator about a new thing called HTML, which was a derivative of SGML. The narrator left Interleaf to pursue art school at RISD, but continued to do freelance work for the group. Eventually, the narrator and two of his friends, Robert and Trevor, started a new company called Viaweb to create a web app that allowed users to build stores through the browser. They opened for business in January 1996 with 6 stores. The software had three main parts: the editor, the shopping cart, and the manager. Streamed output at 26.923490295496002 tokens/s