时间加权重排序¶
演示时间加权节点后处理器的功能
In [ ]
已复制!
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.postprocessor import TimeWeightedPostprocessor
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.storage.docstore import SimpleDocumentStore
from llama_index.core.response.notebook_utils import display_response
from datetime import datetime, timedelta
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader from llama_index.core.postprocessor import TimeWeightedPostprocessor from llama_index.core.node_parser import SentenceSplitter from llama_index.core.storage.docstore import SimpleDocumentStore from llama_index.core.response.notebook_utils import display_response from datetime import datetime, timedelta
/home/loganm/miniconda3/envs/llama-index/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
将文档解析为节点,并添加到文档存储¶
在此示例中,PG 的文章有 3 个不同版本。它们在很大程度上是相同的,除了一个特定的部分,该部分详细说明了他们为 Viaweb 筹集到的资金金额。
V1:5万,V2:3万,V3:1万
V1:-1 天,V2:-2 天,V3:-3 天
想法是鼓励索引获取最新的信息(即 V3)
In [ ]
已复制!
# load documents
from llama_index.core import StorageContext
now = datetime.now()
key = "__last_accessed__"
doc1 = SimpleDirectoryReader(
input_files=["./test_versioned_data/paul_graham_essay_v1.txt"]
).load_data()[0]
doc2 = SimpleDirectoryReader(
input_files=["./test_versioned_data/paul_graham_essay_v2.txt"]
).load_data()[0]
doc3 = SimpleDirectoryReader(
input_files=["./test_versioned_data/paul_graham_essay_v3.txt"]
).load_data()[0]
# define settings
from llama_index.core import Settings
Settings.text_splitter = SentenceSplitter(chunk_size=512)
# use node parser from settings to parse docs into nodes
nodes1 = Settings.text_splitter.get_nodes_from_documents([doc1])
nodes2 = Settings.text_splitter.get_nodes_from_documents([doc2])
nodes3 = Settings.text_splitter.get_nodes_from_documents([doc3])
# fetch the modified chunk from each document, set metadata
# also exclude the date from being read by the LLM
nodes1[14].metadata[key] = (now - timedelta(hours=3)).timestamp()
nodes1[14].excluded_llm_metadata_keys = [key]
nodes2[14].metadata[key] = (now - timedelta(hours=2)).timestamp()
nodes2[14].excluded_llm_metadata_keys = [key]
nodes3[14].metadata[key] = (now - timedelta(hours=1)).timestamp()
nodes2[14].excluded_llm_metadata_keys = [key]
# add to docstore
docstore = SimpleDocumentStore()
nodes = [nodes1[14], nodes2[14], nodes3[14]]
docstore.add_documents(nodes)
storage_context = StorageContext.from_defaults(docstore=docstore)
# load documents from llama_index.core import StorageContext now = datetime.now() key = "__last_accessed__" doc1 = SimpleDirectoryReader( input_files=["./test_versioned_data/paul_graham_essay_v1.txt"] ).load_data()[0] doc2 = SimpleDirectoryReader( input_files=["./test_versioned_data/paul_graham_essay_v2.txt"] ).load_data()[0] doc3 = SimpleDirectoryReader( input_files=["./test_versioned_data/paul_graham_essay_v3.txt"] ).load_data()[0] # define settings from llama_index.core import Settings Settings.text_splitter = SentenceSplitter(chunk_size=512) # use node parser from settings to parse docs into nodes nodes1 = Settings.text_splitter.get_nodes_from_documents([doc1]) nodes2 = Settings.text_splitter.get_nodes_from_documents([doc2]) nodes3 = Settings.text_splitter.get_nodes_from_documents([doc3]) # fetch the modified chunk from each document, set metadata # also exclude the date from being read by the LLM nodes1[14].metadata[key] = (now - timedelta(hours=3)).timestamp() nodes1[14].excluded_llm_metadata_keys = [key] nodes2[14].metadata[key] = (now - timedelta(hours=2)).timestamp() nodes2[14].excluded_llm_metadata_keys = [key] nodes3[14].metadata[key] = (now - timedelta(hours=1)).timestamp() nodes2[14].excluded_llm_metadata_keys = [key] # add to docstore docstore = SimpleDocumentStore() nodes = [nodes1[14], nodes2[14], nodes3[14]] docstore.add_documents(nodes) storage_context = StorageContext.from_defaults(docstore=docstore)
构建索引¶
In [ ]
已复制!
# build index
index = VectorStoreIndex(nodes, storage_context=storage_context)
# build index index = VectorStoreIndex(nodes, storage_context=storage_context)
定义时效性后处理器¶
In [ ]
已复制!
node_postprocessor = TimeWeightedPostprocessor(
time_decay=0.5, time_access_refresh=False, top_k=1
)
node_postprocessor = TimeWeightedPostprocessor( time_decay=0.5, time_access_refresh=False, top_k=1 )
查询索引¶
In [ ]
已复制!
# naive query
query_engine = index.as_query_engine(
similarity_top_k=3,
)
response = query_engine.query(
"How much did the author raise in seed funding from Idelle's husband"
" (Julian) for Viaweb?",
)
# naive query query_engine = index.as_query_engine( similarity_top_k=3, ) response = query_engine.query( "How much did the author raise in seed funding from Idelle's husband" " (Julian) for Viaweb?", )
In [ ]
已复制!
display_response(response)
display_response(response)
最终响应
$50,000
In [ ]
已复制!
# query using time weighted node postprocessor
query_engine = index.as_query_engine(
similarity_top_k=3, node_postprocessors=[node_postprocessor]
)
response = query_engine.query(
"How much did the author raise in seed funding from Idelle's husband"
" (Julian) for Viaweb?",
)
# query using time weighted node postprocessor query_engine = index.as_query_engine( similarity_top_k=3, node_postprocessors=[node_postprocessor] ) response = query_engine.query( "How much did the author raise in seed funding from Idelle's husband" " (Julian) for Viaweb?", )
In [ ]
已复制!
display_response(response)
display_response(response)
最终响应:
作者从伊戴尔的丈夫(朱利安)那里为 Viaweb 筹集了 10,000 美元的种子资金。
查询索引(底层用法)¶
在此示例中,我们首先从查询调用中获取完整的节点集,然后发送给节点后处理器,最后通过摘要索引合成响应。
In [ ]
已复制!
from llama_index.core import SummaryIndex
from llama_index.core import SummaryIndex
In [ ]
已复制!
query_str = (
"How much did the author raise in seed funding from Idelle's husband"
" (Julian) for Viaweb?"
)
query_str = ( "How much did the author raise in seed funding from Idelle's husband" " (Julian) for Viaweb?" )
In [ ]
已复制!
query_engine = index.as_query_engine(
similarity_top_k=3, response_mode="no_text"
)
init_response = query_engine.query(
query_str,
)
resp_nodes = [n for n in init_response.source_nodes]
query_engine = index.as_query_engine( similarity_top_k=3, response_mode="no_text" ) init_response = query_engine.query( query_str, ) resp_nodes = [n for n in init_response.source_nodes]
In [ ]
已复制!
# get the post-processed nodes -- which should be the top-1 sorted by date
new_resp_nodes = node_postprocessor.postprocess_nodes(resp_nodes)
summary_index = SummaryIndex([n.node for n in new_resp_nodes])
query_engine = summary_index.as_query_engine()
response = query_engine.query(query_str)
# get the post-processed nodes -- which should be the top-1 sorted by date new_resp_nodes = node_postprocessor.postprocess_nodes(resp_nodes) summary_index = SummaryIndex([n.node for n in new_resp_nodes]) query_engine = summary_index.as_query_engine() response = query_engine.query(query_str)
In [ ]
已复制!
display_response(response)
display_response(response)
最终响应:
作者从伊戴尔的丈夫(朱利安)那里为 Viaweb 筹集了 10,000 美元的种子资金。