Xorbits Inference¶

在本演示 notebook 中，我们将展示如何通过三个步骤使用 Xorbits Inference (简称 Xinference) 部署本地 LLM。

我们将使用 GGML 格式的 Llama 2 chat 模型作为示例，但代码应可轻松应用于 Xinference 支持的所有 LLM chat 模型。以下是一些示例：

名称	类型	语言	格式	大小 (十亿)	量化
llama-2-chat	RLHF 模型	en	ggmlv3	7, 13, 70	'q2_K', 'q3_K_L', ... , 'q6_K', 'q8_0'
chatglm	SFT 模型	en, zh	ggmlv3	6	'q4_0', 'q4_1', 'q5_0', 'q5_1', 'q8_0'
chatglm2	SFT 模型	en, zh	ggmlv3	6	'q4_0', 'q4_1', 'q5_0', 'q5_1', 'q8_0'
wizardlm-v1.0	SFT 模型	en	ggmlv3	7, 13, 33	'q2_K', 'q3_K_L', ... , 'q6_K', 'q8_0'
wizardlm-v1.1	SFT 模型	en	ggmlv3	13	'q2_K', 'q3_K_L', ... , 'q6_K', 'q8_0'
vicuna-v1.3	SFT 模型	en	ggmlv3	7, 13	'q2_K', 'q3_K_L', ... , 'q6_K', 'q8_0'

最新支持模型的完整列表可在 Xorbits Inference 的官方 GitHub 页面上找到。

🤖 安装 Xinference¶

i. 在终端窗口中运行 pip install "xinference[all]"。

ii. 安装完成后，重启此 jupyter notebook。

iii. 在新的终端窗口中运行 xinference。

iv. 您应该会看到类似以下的输出：

INFO:xinference:Xinference successfully started. Endpoint: http://127.0.0.1:9997
INFO:xinference.core.service:Worker 127.0.0.1:21561 has been added successfully
INFO:xinference.deploy.worker:Xinference worker successfully started.

v. 在端点描述中，找到冒号后的端口号。在上述情况下是 9997。

vi. 使用以下单元格设置端口号：

输入 [ ]

已复制！

%pip install llama-index-llms-xinference
%pip install llama-index-llms-xinference

输入 [ ]

已复制！

port = 9997  # replace with your endpoint port number
port = 9997 # replace with your endpoint port number

🚀 启动本地模型¶

在此步骤中，我们首先从 llama_index 导入相关库：

如果您在 colab 上打开此 Notebook，您可能需要安装 LlamaIndex 🦙。

输入 [ ]

已复制！

!pip install llama-index
!pip install llama-index # If Xinference can not be imported, you may need to restart jupyter notebook from llama_index.core import SummaryIndex from llama_index.core import ( TreeIndex, VectorStoreIndex, KeywordTableIndex, KnowledgeGraphIndex, SimpleDirectoryReader, ) from llama_index.llms.xinference import Xinference from xinference.client import RESTfulClient from IPython.display import Markdown, display

输入 [ ]

已复制！





# If Xinference can not be imported, you may need to restart jupyter notebook
from llama_index.core import SummaryIndex
from llama_index.core import (
    TreeIndex,
    VectorStoreIndex,
    KeywordTableIndex,
    KnowledgeGraphIndex,
    SimpleDirectoryReader,
)
from llama_index.llms.xinference import Xinference
from xinference.client import RESTfulClient
from IPython.display import Markdown, display
然后，我们启动一个模型并使用它。这使我们可以在后续步骤中将模型连接到文档和查询。

随意更改参数以获得更好的性能！为了获得最佳结果，建议使用大小超过 13B 的模型。话虽如此，对于这个简短的演示来说，7B 模型已经足够了。

以下是 GGML 格式的 Llama 2 chat 模型的一些更多参数选项，从最节省空间到资源密集型但高性能的排列：

模型大小 (十亿)

7B 和 13B 模型的量化

7, 13, 70

q2_K, q3_K_L, q3_K_M, q3_K_S, q4_0, q4_1, q4_K_M, q4_K_S, q5_0, q5_1, q5_K_M, q5_K_S, q6_K, q8_0

70B 模型的量化

q4_0

# Define a client to send commands to xinference client = RESTfulClient(f"https://:{port}") # Download and Launch a model, this may take a while the first time model_uid = client.launch_model( model_name="llama-2-chat", model_size_in_billions=7, model_format="ggmlv3", quantization="q2_K", ) # Initiate Xinference object to use the LLM llm = Xinference( endpoint=f"https://:{port}", model_uid=model_uid, temperature=0.0, max_tokens=512, )

输入 [ ]

已复制！





# Define a client to send commands to xinference
client = RESTfulClient(f"https://:{port}")

# Download and Launch a model, this may take a while the first time
model_uid = client.launch_model(
    model_name="llama-2-chat",
    model_size_in_billions=7,
    model_format="ggmlv3",
    quantization="q2_K",
)

# Initiate Xinference object to use the LLM
llm = Xinference(
    endpoint=f"https://:{port}",
    model_uid=model_uid,
    temperature=0.0,
    max_tokens=512,
)
🕺  索引数据... 然后聊天！¶

在此步骤中，我们将模型和数据结合起来创建一个查询引擎。然后可以将查询引擎用作聊天机器人，根据给定数据回答我们的查询。

我们将使用 VectorStoreIndex，因为它相对较快。话虽如此，您可以随意更改索引以获得不同的体验。以下是上一步中已导入的一些可用索引：

ListIndex, TreeIndex, VectorStoreIndex, KeywordTableIndex, KnowledgeGraphIndex

要更改索引，只需在以下代码中将 VectorStoreIndex 替换为另一个索引即可。

Llama Index 官方文档中提供了所有可用索引的最新完整列表。

# create index from the data documents = SimpleDirectoryReader("../data/paul_graham").load_data() # change index name in the following line index = VectorStoreIndex.from_documents(documents=documents) # create the query engine query_engine = index.as_query_engine(llm=llm)

输入 [ ]

已复制！

# create index from the data
documents = SimpleDirectoryReader("../data/paul_graham").load_data()

# change index name in the following line
index = VectorStoreIndex.from_documents(documents=documents)

# create the query engine
query_engine = index.as_query_engine(llm=llm)
我们可以选择直接通过 Xinference 对象设置温度和最大答案长度（以令牌为单位），然后再提问。这使我们可以针对不同的问题更改参数，而无需每次都重建查询引擎。

temperature 是一个介于 0 和 1 之间的数字，用于控制响应的随机性。较高的值会增加创造力，但也可能导致离题的回复。设置为零可保证每次都获得相同的响应。

max_tokens 是一个整数，用于设置响应长度的上限。如果答案看起来被截断，请增加它，但请注意，响应过长可能会超出上下文窗口并导致错误。

# optionally, update the temperature and max answer length (in tokens) llm.__dict__.update({"temperature": 0.0}) llm.__dict__.update({"max_tokens": 2048}) # ask a question and display the answer question = "What did the author do after his time at Y Combinator?" response = query_engine.query(question) display(Markdown(f"{response}"))

输入 [ ]

已复制！

# optionally, update the temperature and max answer length (in tokens)
llm.__dict__.update({"temperature": 0.0})
llm.__dict__.update({"max_tokens": 2048})

# ask a question and display the answer
question = "What did the author do after his time at Y Combinator?"

response = query_engine.query(question)
display(Markdown(f"<b>{response}</b>"))
返回顶部