Redis Docstore+Index Store 演示¶
本指南向您展示如何直接使用基于 Redis 的 DocumentStore
抽象和 IndexStore
抽象。通过将节点放入文档存储中,您可以在同一个底层文档存储上定义多个索引,而不是在不同索引中重复数据。
索引本身也通过 IndexStore
存储在 Redis 中。
如果您在 colab 上打开此 Notebook,您可能需要安装 LlamaIndex 🦙。
%pip install llama-index-storage-docstore-redis
%pip install llama-index-storage-index-store-redis
%pip install llama-index-llms-openai
!pip install llama-index
import nest_asyncio
nest_asyncio.apply()
import logging
import sys
import os
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
from llama_index.core import SimpleDirectoryReader, StorageContext
from llama_index.core import VectorStoreIndex, SimpleKeywordTableIndex
from llama_index.core import SummaryIndex
from llama_index.core import ComposableGraph
from llama_index.llms.openai import OpenAI
from llama_index.core.response.notebook_utils import display_response
from llama_index.core import Settings
INFO:numexpr.utils:Note: NumExpr detected 16 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8. Note: NumExpr detected 16 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8. INFO:numexpr.utils:NumExpr defaulting to 8 threads. NumExpr defaulting to 8 threads.
/home/loganm/miniconda3/envs/llama-index/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
下载数据¶
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
加载文档¶
reader = SimpleDirectoryReader("./data/paul_graham/")
documents = reader.load_data()
解析为节点¶
from llama_index.core.node_parser import SentenceSplitter
nodes = SentenceSplitter().get_nodes_from_documents(documents)
添加到 Docstore¶
REDIS_HOST = os.getenv("REDIS_HOST", "127.0.0.1")
REDIS_PORT = os.getenv("REDIS_PORT", 6379)
from llama_index.storage.docstore.redis import RedisDocumentStore
from llama_index.storage.index_store.redis import RedisIndexStore
/home/loganm/miniconda3/envs/llama-index/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
storage_context = StorageContext.from_defaults(
docstore=RedisDocumentStore.from_host_and_port(
host=REDIS_HOST, port=REDIS_PORT, namespace="llama_index"
),
index_store=RedisIndexStore.from_host_and_port(
host=REDIS_HOST, port=REDIS_PORT, namespace="llama_index"
),
)
storage_context.docstore.add_documents(nodes)
len(storage_context.docstore.docs)
20
定义多个索引¶
每个索引使用相同的底层节点。
summary_index = SummaryIndex(nodes, storage_context=storage_context)
INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens > [build_index_from_nodes] Total LLM token usage: 0 tokens INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 0 tokens > [build_index_from_nodes] Total embedding token usage: 0 tokens
vector_index = VectorStoreIndex(nodes, storage_context=storage_context)
INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens > [build_index_from_nodes] Total LLM token usage: 0 tokens INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 17050 tokens > [build_index_from_nodes] Total embedding token usage: 17050 tokens
keyword_table_index = SimpleKeywordTableIndex(
nodes, storage_context=storage_context
)
INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens > [build_index_from_nodes] Total LLM token usage: 0 tokens INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 0 tokens > [build_index_from_nodes] Total embedding token usage: 0 tokens
# NOTE: the docstore still has the same nodes
len(storage_context.docstore.docs)
20
测试保存和加载¶
# NOTE: docstore and index_store is persisted in Redis by default
# NOTE: here only need to persist simple vector store to disk
storage_context.persist(persist_dir="./storage")
# note down index IDs
list_id = summary_index.index_id
vector_id = vector_index.index_id
keyword_id = keyword_table_index.index_id
from llama_index.core import load_index_from_storage
# re-create storage context
storage_context = StorageContext.from_defaults(
docstore=RedisDocumentStore.from_host_and_port(
host=REDIS_HOST, port=REDIS_PORT, namespace="llama_index"
),
index_store=RedisIndexStore.from_host_and_port(
host=REDIS_HOST, port=REDIS_PORT, namespace="llama_index"
),
)
# load indices
summary_index = load_index_from_storage(
storage_context=storage_context, index_id=list_id
)
vector_index = load_index_from_storage(
storage_context=storage_context, index_id=vector_id
)
keyword_table_index = load_index_from_storage(
storage_context=storage_context, index_id=keyword_id
)
INFO:llama_index.indices.loading:Loading indices with ids: ['24e98f9b-9586-4fc6-8341-8dce895e5bcc'] Loading indices with ids: ['24e98f9b-9586-4fc6-8341-8dce895e5bcc'] INFO:llama_index.indices.loading:Loading indices with ids: ['f7b2aeb3-4dad-4750-8177-78d5ae706284'] Loading indices with ids: ['f7b2aeb3-4dad-4750-8177-78d5ae706284'] INFO:llama_index.indices.loading:Loading indices with ids: ['9a9198b4-7cb9-4c96-97a7-5f404f43b9cd'] Loading indices with ids: ['9a9198b4-7cb9-4c96-97a7-5f404f43b9cd']
测试一些查询¶
chatgpt = OpenAI(temperature=0, model="gpt-3.5-turbo")
Settings.llm = chatgpt
Settings.chunk_size = 1024
query_engine = summary_index.as_query_engine()
list_response = query_engine.query("What is a summary of this document?")
INFO:llama_index.token_counter.token_counter:> [get_response] Total LLM token usage: 26111 tokens > [get_response] Total LLM token usage: 26111 tokens INFO:llama_index.token_counter.token_counter:> [get_response] Total embedding token usage: 0 tokens > [get_response] Total embedding token usage: 0 tokens
display_response(list_response)
最终回应:
这篇文档叙述了作者从年轻时写作和编程到追求艺术事业的历程。他描述了自己在高中、大学和研究生阶段的经历,以及最终如何决定将艺术作为职业。他申请了艺术学校,并最终被 RISD 和佛罗伦萨 Accademia di Belli Arti 录取。他通过了 Accademia 的入学考试并在那里开始学习艺术。随后,他搬到纽约,一边自由职业,一边撰写一本关于 Lisp 的书。他最终创办了一家公司,旨在将艺术画廊搬到线上,但没有成功。然后,他转向创建用于构建在线商店的软件,该软件最终获得了成功。他想到了在服务器上运行软件,并通过点击链接让用户控制它的想法,这意味着用户只需要一个浏览器。这种被称为“互联网店面”的软件最终取得了成功。他和他的团队努力使软件用户友好且廉价,最终公司被雅虎收购。收购后,他离开去追求他的绘画梦想,最终在纽约取得了成功。他能够负担得起出租车和餐馆等奢侈品,并尝试了一种新型静物画。他还萌生了创建用于制作网络应用的 Web 应用的想法,并最终付诸实践并取得了成功。随后,他用自己的钱以及朋友 Robert 和 Trevor 的帮助,创办了投资公司 Y Combinator,专注于帮助初创公司。他撰写论文和书籍,邀请本科生申请夏季创始人计划,并最终与 Jessica Livingston 结婚。母亲去世后,他决定退出 Y Combinator,继续追求绘画,但最终精力耗尽,再次开始写作论文并研究 Lisp。他用 Arc 编写了一种新的 Lisp,称为 Bel,耗时四年完成。在此期间,他努力使该语言用户友好且精确,同时也花时间与家人享受生活。他一路遇到了各种障碍,例如即使导致限制的规定已经消失,习俗仍然约束着他;他还必须应对论坛上对他论文的误解。最终,他成功地创建了 Bel,并得以实现他的绘画梦想。
query_engine = vector_index.as_query_engine()
vector_response = query_engine.query("What did the author do growing up?")
INFO:llama_index.token_counter.token_counter:> [retrieve] Total LLM token usage: 0 tokens > [retrieve] Total LLM token usage: 0 tokens INFO:llama_index.token_counter.token_counter:> [retrieve] Total embedding token usage: 8 tokens > [retrieve] Total embedding token usage: 8 tokens INFO:llama_index.token_counter.token_counter:> [get_response] Total LLM token usage: 0 tokens > [get_response] Total LLM token usage: 0 tokens INFO:llama_index.token_counter.token_counter:> [get_response] Total embedding token usage: 0 tokens > [get_response] Total embedding token usage: 0 tokens
display_response(vector_response)
最终回应:
无
query_engine = keyword_table_index.as_query_engine()
keyword_response = query_engine.query(
"What did the author do after his time at YC?"
)
INFO:llama_index.indices.keyword_table.retrievers:> Starting query: What did the author do after his time at YC? > Starting query: What did the author do after his time at YC? INFO:llama_index.indices.keyword_table.retrievers:query keywords: ['action', 'yc', 'after', 'time', 'author'] query keywords: ['action', 'yc', 'after', 'time', 'author'] INFO:llama_index.indices.keyword_table.retrievers:> Extracted keywords: ['yc', 'time'] > Extracted keywords: ['yc', 'time'] INFO:llama_index.token_counter.token_counter:> [get_response] Total LLM token usage: 10216 tokens > [get_response] Total LLM token usage: 10216 tokens INFO:llama_index.token_counter.token_counter:> [get_response] Total embedding token usage: 0 tokens > [get_response] Total embedding token usage: 0 tokens
display_response(keyword_response)
最终回应:
在 YC 之后,作者决定继续追求绘画和写作。他想看看如果全身心投入,自己能达到什么水平,所以他在停止 YC 工作的那天就开始绘画。他将 2014 年的大部分时间用于绘画,并且比以前有所进步。他还写论文,并在 2015 年 3 月重新开始研究 Lisp。随后,他花了四年时间研究一种新的 Lisp,叫做 Bel,他用 Arc 自己编写了 Bel。在此期间的大部分时间里,他不得不禁止自己写论文,并于 2016 年夏天搬到英格兰。他还写了一本关于 Lisp 黑客的书,叫做 On Lisp,于 1993 年出版。2019 年秋天,Bel 终于完成。他还尝试了一种新型静物画,并试图构建一个用于制作网络应用的 Web 应用,他将其命名为 Aspra。最终,他决定将此应用的一个子集作为一个开源项目来构建,这就是他称之为 Arc 的新 Lisp 方言。