Discord 话题管理¶
本 Notebook 将逐步讲解如何管理来自不断更新的数据源的文档。
在本示例中,我们有一个目录,其中定期转储 LlamaIndex Discord 服务器上的 #issues-and-help 频道内容。我们希望确保我们的索引始终包含最新数据,同时不重复任何消息。
索引 Discord 数据¶
Discord 数据以顺序消息的形式转储。每条消息都包含有用的信息,例如时间戳、作者,如果消息是话题的一部分,还包含指向父消息的链接。
我们的 Discord 帮助频道在解决问题时通常使用话题,因此我们将把所有消息分组到话题中,并将每个话题作为单独的文档进行索引。
首先,让我们看看正在处理的数据。
import os
print(os.listdir("./discord_dumps"))
['help_channel_dump_06_02_23.json', 'help_channel_dump_05_25_23.json']
如您所见,我们有两个来自不同日期的转储。让我们假装开始时只有较旧的转储,并希望从这些数据创建索引。
首先,让我们稍作探索数据
import json
with open("./discord_dumps/help_channel_dump_05_25_23.json", "r") as f:
data = json.load(f)
print("JSON keys: ", data.keys(), "\n")
print("Message Count: ", len(data["messages"]), "\n")
print("Sample Message Keys: ", data["messages"][0].keys(), "\n")
print("First Message: ", data["messages"][0]["content"], "\n")
print("Last Message: ", data["messages"][-1]["content"])
JSON keys: dict_keys(['guild', 'channel', 'dateRange', 'messages', 'messageCount']) Message Count: 5087 Sample Message Keys: dict_keys(['id', 'type', 'timestamp', 'timestampEdited', 'callEndedTimestamp', 'isPinned', 'content', 'author', 'attachments', 'embeds', 'stickers', 'reactions', 'mentions']) First Message: If you're running into any bugs, issues, or you have questions as to how to best use GPT Index, put those here! - If it's a bug, let's also track as a GH issue: https://github.com/jerryjliu/gpt_index/issues. Last Message: Hello there! How can I use llama_index with GPU?
为方便起见,我提供了一个脚本,可以将这些消息分组到话题中。有关更多详细信息,请参阅 group_conversations.py
脚本。输出文件将是一个 json 列表,列表中的每个项都是一个 Discord 话题。
!python ./group_conversations.py ./discord_dumps/help_channel_dump_05_25_23.json
Done! Written to conversation_docs.json
with open("conversation_docs.json", "r") as f:
threads = json.load(f)
print("Thread keys: ", threads[0].keys(), "\n")
print(threads[0]["metadata"], "\n")
print(threads[0]["thread"], "\n")
Thread keys: dict_keys(['thread', 'metadata']) {'timestamp': '2023-01-02T03:36:04.191+00:00', 'id': '1059314106907242566'} arminta7: Hello all! Thanks to GPT_Index I've managed to put together a script that queries my extensive personal note collection which is a local directory of about 20k markdown files. Some of which are very long. I work in this folder all day everyday, so there are frequent changes. Currently I would need to rerun the entire indexing (is that the correct term?) when I want to incorporate edits I've made. So my question is... is there a way to schedule indexing to maybe once per day and only add information for files that have changed? Or even just manually run it but still only add edits? This would make a huge difference in saving time (I have to leave it running overnight for the entire directory) as well as cost 😬. Excuse me if this is a dumb question, I'm not a programmer and am sort of muddling around figuring this out 🤓 Thank you for making this sort of project accessible to someone like me! ragingWater_: I had a similar problem which I solved the following way in another world: - if you have a list of files, you want something which says that edits were made in the last day, possibly looking at the last_update_time of the file should help you. - for decreasing the cost, I would suggest maybe doing a keyword extraction or summarization of your notes and generating an embedding for it. Take your NLP query and get the most similar file (cosine similarity by pinecone db should help, GPTIndex also has a faiss) this should help with your cost needs
现在,我们得到了一个话题列表,我们可以将其转换为文档并进行索引!
创建初始索引¶
from llama_index.core import Document
# create document objects using doc_id's and dates from each thread
documents = []
for thread in threads:
thread_text = thread["thread"]
thread_id = thread["metadata"]["id"]
timestamp = thread["metadata"]["timestamp"]
documents.append(
Document(text=thread_text, id_=thread_id, metadata={"date": timestamp})
)
from llama_index.core import VectorStoreIndex
index = VectorStoreIndex.from_documents(documents)
让我们再次检查索引实际摄入了哪些文档
print("ref_docs ingested: ", len(index.ref_doc_info))
print("number of input documents: ", len(documents))
ref_docs ingested: 767 number of input documents: 767
到目前为止一切顺利。我们再检查一个特定的话题,以确保元数据工作正常,并检查它被分割成了多少个节点
thread_id = threads[0]["metadata"]["id"]
print(index.ref_doc_info[thread_id])
RefDocInfo(node_ids=['0c530273-b6c3-4848-a760-fe73f5f8136e'], metadata={'date': '2023-01-02T03:36:04.191+00:00'})
太好了!我们的话题很短,所以它直接被分块成一个节点。此外,我们可以看到日期字段设置正确。
接下来,让我们备份索引,这样就不必浪费 token 再次进行索引。
# save the initial index
index.storage_context.persist(persist_dir="./storage")
# load it again to confirm it worked
from llama_index.core import StorageContext, load_index_from_storage
index = load_index_from_storage(
StorageContext.from_defaults(persist_dir="./storage")
)
print("Double check ref_docs ingested: ", len(index.ref_doc_info))
Double check ref_docs ingested: 767
用新数据刷新索引!¶
现在,我们突然想起我们有了新的 Discord 消息转储!与其从头重建整个索引,我们可以使用 refresh()
函数只索引新文档。
由于我们手动设置了每个索引的 doc_id
,LlamaIndex 可以比较传入的具有相同 doc_id
的文档,以确认 a) doc_id
是否已被实际摄入,以及 b) 内容是否已更改。
refresh
函数将返回一个布尔数组,指示输入中的哪些文档被刷新或插入。我们可以使用它来确认只插入了新的 Discord 话题!
当文档内容发生变化时,会调用 update()
函数,该函数将文档从索引中移除并重新插入。
import json
with open("./discord_dumps/help_channel_dump_06_02_23.json", "r") as f:
data = json.load(f)
print("JSON keys: ", data.keys(), "\n")
print("Message Count: ", len(data["messages"]), "\n")
print("Sample Message Keys: ", data["messages"][0].keys(), "\n")
print("First Message: ", data["messages"][0]["content"], "\n")
print("Last Message: ", data["messages"][-1]["content"])
JSON keys: dict_keys(['guild', 'channel', 'dateRange', 'messages', 'messageCount']) Message Count: 5286 Sample Message Keys: dict_keys(['id', 'type', 'timestamp', 'timestampEdited', 'callEndedTimestamp', 'isPinned', 'content', 'author', 'attachments', 'embeds', 'stickers', 'reactions', 'mentions']) First Message: If you're running into any bugs, issues, or you have questions as to how to best use GPT Index, put those here! - If it's a bug, let's also track as a GH issue: https://github.com/jerryjliu/gpt_index/issues. Last Message: Started a thread.
如我们所见,第一条消息与原始转储相同。但现在我们有大约 200 条新消息,最后一条消息显然是新的!refresh()
将使我们的索引更新变得容易。
首先,让我们创建新的话题/文档
!python ./group_conversations.py ./discord_dumps/help_channel_dump_06_02_23.json
Done! Written to conversation_docs.json
with open("conversation_docs.json", "r") as f:
threads = json.load(f)
print("Thread keys: ", threads[0].keys(), "\n")
print(threads[0]["metadata"], "\n")
print(threads[0]["thread"], "\n")
Thread keys: dict_keys(['thread', 'metadata']) {'timestamp': '2023-01-02T03:36:04.191+00:00', 'id': '1059314106907242566'} arminta7: Hello all! Thanks to GPT_Index I've managed to put together a script that queries my extensive personal note collection which is a local directory of about 20k markdown files. Some of which are very long. I work in this folder all day everyday, so there are frequent changes. Currently I would need to rerun the entire indexing (is that the correct term?) when I want to incorporate edits I've made. So my question is... is there a way to schedule indexing to maybe once per day and only add information for files that have changed? Or even just manually run it but still only add edits? This would make a huge difference in saving time (I have to leave it running overnight for the entire directory) as well as cost 😬. Excuse me if this is a dumb question, I'm not a programmer and am sort of muddling around figuring this out 🤓 Thank you for making this sort of project accessible to someone like me! ragingWater_: I had a similar problem which I solved the following way in another world: - if you have a list of files, you want something which says that edits were made in the last day, possibly looking at the last_update_time of the file should help you. - for decreasing the cost, I would suggest maybe doing a keyword extraction or summarization of your notes and generating an embedding for it. Take your NLP query and get the most similar file (cosine similarity by pinecone db should help, GPTIndex also has a faiss) this should help with your cost needs
# create document objects using doc_id's and dates from each thread
new_documents = []
for thread in threads:
thread_text = thread["thread"]
thread_id = thread["metadata"]["id"]
timestamp = thread["metadata"]["timestamp"]
new_documents.append(
Document(text=thread_text, id_=thread_id, metadata={"date": timestamp})
)
print("Number of new documents: ", len(new_documents) - len(documents))
Number of new documents: 13
# now, refresh!
refreshed_docs = index.refresh(
new_documents,
update_kwargs={"delete_kwargs": {"delete_from_docstore": True}},
)
默认情况下,如果文档内容发生更改并被更新,我们可以传递一个额外的标志 delete_from_docstore
。此标志默认为 False
,因为索引可以共享文档存储。但由于我们只有一个索引,在此处从文档存储中删除是可以的。
如果我们将此选项保留为 False
,文档信息仍将从 index_struct
中移除,这会使该文档对索引不可见。
print("Number of newly inserted/refreshed docs: ", sum(refreshed_docs))
Number of newly inserted/refreshed docs: 15
有趣的是,我们有 13 个新文档,但有 15 个文档被刷新。是有人编辑了他们的消息吗?还是向话题添加了更多文本?我们来查一下
print(refreshed_docs[-25:])
[False, True, False, False, True, False, False, False, False, False, False, False, True, True, True, True, True, True, True, True, True, True, True, True, True]
new_documents[-21]
Document(id_='1110938122902048809', embedding=None, weight=1.0, metadata={'date': '2023-05-24T14:31:28.732+00:00'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, hash='36d308d1d2d1aa5cbfdb2f7d64709644a68805ec22a6053943f985084eec340e', text='Siddhant Saurabh:\nhey facing error\n```\n*error_trace: Traceback (most recent call last):\n File "/app/src/chatbot/query_gpt.py", line 248, in get_answer\n context_answer = self.call_pinecone_index(request)\n File "/app/src/chatbot/query_gpt.py", line 229, in call_pinecone_index\n self.source.append(format_cited_source(source_node.doc_id))\n File "/usr/local/lib/python3.8/site-packages/llama_index/data_structs/node.py", line 172, in doc_id\n return self.node.ref_doc_id\n File "/usr/local/lib/python3.8/site-packages/llama_index/data_structs/node.py", line 87, in ref_doc_id\n return self.relationships.get(DocumentRelationship.SOURCE, None)\nAttributeError: \'Field\' object has no attribute \'get\'\n```\nwith latest llama_index 0.6.9\n@Logan M @jerryjliu98 @ravitheja\nLogan M:\nHow are you inserting nodes/documents? That attribute on the node should be set automatically usually\nSiddhant Saurabh:\nI think this happened because of the error mentioned by me here https://discord.com/channels/1059199217496772688/1106229492369850468/1108453477081948280\nI think we need to re-preprocessing for such nodes, right?\n', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n')
documents[-8]
Document(id_='1110938122902048809', embedding=None, weight=1.0, metadata={'date': '2023-05-24T14:31:28.732+00:00'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, hash='c995c43873440a9d0263de70fff664269ec70d751c6e8245b290882ec5b656a1', text='Siddhant Saurabh:\nhey facing error\n```\n*error_trace: Traceback (most recent call last):\n File "/app/src/chatbot/query_gpt.py", line 248, in get_answer\n context_answer = self.call_pinecone_index(request)\n File "/app/src/chatbot/query_gpt.py", line 229, in call_pinecone_index\n self.source.append(format_cited_source(source_node.doc_id))\n File "/usr/local/lib/python3.8/site-packages/llama_index/data_structs/node.py", line 172, in doc_id\n return self.node.ref_doc_id\n File "/usr/local/lib/python3.8/site-packages/llama_index/data_structs/node.py", line 87, in ref_doc_id\n return self.relationships.get(DocumentRelationship.SOURCE, None)\nAttributeError: \'Field\' object has no attribute \'get\'\n```\nwith latest llama_index 0.6.9\n@Logan M @jerryjliu98 @ravitheja\nLogan M:\nHow are you inserting nodes/documents? That attribute on the node should be set automatically usually\n', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n')
太棒了!较新的文档包含更多消息的话题。如您所见,refresh()
能够检测到这一点,并自动用更新的文本替换了旧的话题。