提取元数据以改进文档索引和理解¶
在许多情况下,特别是对于长文档,文本块可能缺乏必要的上下文,无法区分该文本块与其他相似的文本块。解决这个问题的一种方法是手动标记数据集或知识库中的每个文本块。然而,对于大量或持续更新的文档集来说,这可能是劳动密集型且耗时的。
为了解决这个问题,我们使用 LLM 提取与文档相关的特定上下文信息,以便更好地帮助检索和语言模型区分相似的段落。
我们通过全新的 Metadata Extractor
模块来实现这一点。
如果您在 Colab 上打开此 Notebook,您可能需要安装 LlamaIndex 🦙。
%pip install llama-index-llms-openai
%pip install llama-index-extractors-entity
!pip install llama-index
import nest_asyncio
nest_asyncio.apply()
import os
import openai
os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY_HERE"
from llama_index.llms.openai import OpenAI
from llama_index.core.schema import MetadataMode
llm = OpenAI(temperature=0.1, model="gpt-3.5-turbo", max_tokens=512)
我们创建一个节点解析器,用于提取文档标题和与文档块相关的假设性问题嵌入。
我们还将展示如何实例化 SummaryExtractor
和 KeywordExtractor
,以及如何基于 BaseExtractor
基类创建自己的自定义提取器
from llama_index.core.extractors import (
SummaryExtractor,
QuestionsAnsweredExtractor,
TitleExtractor,
KeywordExtractor,
BaseExtractor,
)
from llama_index.extractors.entity import EntityExtractor
from llama_index.core.node_parser import TokenTextSplitter
text_splitter = TokenTextSplitter(
separator=" ", chunk_size=512, chunk_overlap=128
)
class CustomExtractor(BaseExtractor):
def extract(self, nodes):
metadata_list = [
{
"custom": (
node.metadata["document_title"]
+ "\n"
+ node.metadata["excerpt_keywords"]
)
}
for node in nodes
]
return metadata_list
extractors = [
TitleExtractor(nodes=5, llm=llm),
QuestionsAnsweredExtractor(questions=3, llm=llm),
# EntityExtractor(prediction_threshold=0.5),
# SummaryExtractor(summaries=["prev", "self"], llm=llm),
# KeywordExtractor(keywords=10, llm=llm),
# CustomExtractor()
]
transformations = [text_splitter] + extractors
from llama_index.core import SimpleDirectoryReader
我们首先分别加载 Uber 和 Lyft 2019 年和 2020 年的 10k 年度 SEC 报告。
!mkdir -p data
!wget -O "data/10k-132.pdf" "https://www.dropbox.com/scl/fi/6dlqdk6e2k1mjhi8dee5j/uber.pdf?rlkey=2jyoe49bg2vwdlz30l76czq6g&dl=1"
!wget -O "data/10k-vFinal.pdf" "https://www.dropbox.com/scl/fi/qn7g3vrk5mqb18ko4e5in/lyft.pdf?rlkey=j6jxtjwo8zbstdo4wz3ns8zoj&dl=1"
# Note the uninformative document file name, which may be a common scenario in a production setting
uber_docs = SimpleDirectoryReader(input_files=["data/10k-132.pdf"]).load_data()
uber_front_pages = uber_docs[0:3]
uber_content = uber_docs[63:69]
uber_docs = uber_front_pages + uber_content
from llama_index.core.ingestion import IngestionPipeline
pipeline = IngestionPipeline(transformations=transformations)
uber_nodes = pipeline.run(documents=uber_docs)
uber_nodes[1].metadata
{'page_label': '2', 'file_name': '10k-132.pdf', 'document_title': 'Exploring the Diverse Landscape of 2019: A Comprehensive Annual Report on Uber Technologies, Inc.', 'questions_this_excerpt_can_answer': '1. How many countries does Uber operate in?\n2. What is the total gross bookings of Uber in 2019?\n3. How many trips did Uber facilitate in 2019?'}
# Note the uninformative document file name, which may be a common scenario in a production setting
lyft_docs = SimpleDirectoryReader(
input_files=["data/10k-vFinal.pdf"]
).load_data()
lyft_front_pages = lyft_docs[0:3]
lyft_content = lyft_docs[68:73]
lyft_docs = lyft_front_pages + lyft_content
from llama_index.core.ingestion import IngestionPipeline
pipeline = IngestionPipeline(transformations=transformations)
lyft_nodes = pipeline.run(documents=lyft_docs)
lyft_nodes[2].metadata
{'page_label': '2', 'file_name': '10k-vFinal.pdf', 'document_title': 'Lyft, Inc. Annual Report on Form 10-K for the Fiscal Year Ended December 31, 2020', 'questions_this_excerpt_can_answer': "1. Has Lyft, Inc. filed a report on and attestation to its management's assessment of the effectiveness of its internal control over financial reporting under Section 404(b) of the Sarbanes-Oxley Act?\n2. Is Lyft, Inc. considered a shell company according to Rule 12b-2 of the Exchange Act?\n3. What was the aggregate market value of Lyft, Inc.'s common stock held by non-affiliates on June 30, 2020?"}
由于我们提出了相当复杂的问题,我们在下面所有问答管道中都使用了子问题查询引擎(subquestion query engine),并提示它更关注检索到的来源的相关性。
from llama_index.core.question_gen import LLMQuestionGenerator
from llama_index.core.question_gen.prompts import (
DEFAULT_SUB_QUESTION_PROMPT_TMPL,
)
question_gen = LLMQuestionGenerator.from_defaults(
llm=llm,
prompt_template_str="""
Follow the example, but instead of giving a question, always prefix the question
with: 'By first identifying and quoting the most relevant sources, '.
"""
+ DEFAULT_SUB_QUESTION_PROMPT_TMPL,
)
在没有额外元数据的情况下查询索引¶
from copy import deepcopy
nodes_no_metadata = deepcopy(uber_nodes) + deepcopy(lyft_nodes)
for node in nodes_no_metadata:
node.metadata = {
k: node.metadata[k]
for k in node.metadata
if k in ["page_label", "file_name"]
}
print(
"LLM sees:\n",
(nodes_no_metadata)[9].get_content(metadata_mode=MetadataMode.LLM),
)
LLM sees: [Excerpt from document] page_label: 65 file_name: 10k-132.pdf Excerpt: ----- See the section titled “Reconciliations of Non-GAAP Financial Measures” for our definition and a reconciliation of net income (loss) attributable to Uber Technologies, Inc. to Adjusted EBITDA. Year Ended December 31, 2017 to 2018 2018 to 2019 (In millions, exce pt percenta ges) 2017 2018 2019 % Chan ge % Chan ge Adjusted EBITDA ................................ $ (2,642) $ (1,847) $ (2,725) 30% (48)% -----
from llama_index.core import VectorStoreIndex
from llama_index.core.query_engine import SubQuestionQueryEngine
from llama_index.core.tools import QueryEngineTool, ToolMetadata
index_no_metadata = VectorStoreIndex(
nodes=nodes_no_metadata,
)
engine_no_metadata = index_no_metadata.as_query_engine(
similarity_top_k=10, llm=OpenAI(model="gpt-4")
)
final_engine_no_metadata = SubQuestionQueryEngine.from_defaults(
query_engine_tools=[
QueryEngineTool(
query_engine=engine_no_metadata,
metadata=ToolMetadata(
name="sec_filing_documents",
description="financial information on companies",
),
)
],
question_gen=question_gen,
use_async=True,
)
response_no_metadata = final_engine_no_metadata.query(
"""
What was the cost due to research and development v.s. sales and marketing for uber and lyft in 2019 in millions of USD?
Give your answer as a JSON.
"""
)
print(response_no_metadata.response)
# Correct answer:
# {"Uber": {"Research and Development": 4836, "Sales and Marketing": 4626},
# "Lyft": {"Research and Development": 1505.6, "Sales and Marketing": 814 }}
Generated 4 sub questions. [sec_filing_documents] Q: What was the cost due to research and development for Uber in 2019 [sec_filing_documents] Q: What was the cost due to sales and marketing for Uber in 2019 [sec_filing_documents] Q: What was the cost due to research and development for Lyft in 2019 [sec_filing_documents] Q: What was the cost due to sales and marketing for Lyft in 2019 [sec_filing_documents] A: The cost due to sales and marketing for Uber in 2019 was $814,122 in thousands. [sec_filing_documents] A: The cost due to research and development for Uber in 2019 was $1,505,640 in thousands. [sec_filing_documents] A: The cost of research and development for Lyft in 2019 was $1,505,640 in thousands. [sec_filing_documents] A: The cost due to sales and marketing for Lyft in 2019 was $814,122 in thousands. { "Uber": { "Research and Development": 1505.64, "Sales and Marketing": 814.122 }, "Lyft": { "Research and Development": 1505.64, "Sales and Marketing": 814.122 } }
结果:正如我们所见,问答代理似乎不知道在哪里查找正确的文档。因此,它完全弄混了Lyft和Uber的数据。
在提取的元数据下查询索引¶
print(
"LLM sees:\n",
(uber_nodes + lyft_nodes)[9].get_content(metadata_mode=MetadataMode.LLM),
)
LLM sees: [Excerpt from document] page_label: 65 file_name: 10k-132.pdf document_title: Exploring the Diverse Landscape of 2019: A Comprehensive Annual Report on Uber Technologies, Inc. Excerpt: ----- See the section titled “Reconciliations of Non-GAAP Financial Measures” for our definition and a reconciliation of net income (loss) attributable to Uber Technologies, Inc. to Adjusted EBITDA. Year Ended December 31, 2017 to 2018 2018 to 2019 (In millions, exce pt percenta ges) 2017 2018 2019 % Chan ge % Chan ge Adjusted EBITDA ................................ $ (2,642) $ (1,847) $ (2,725) 30% (48)% -----
index = VectorStoreIndex(
nodes=uber_nodes + lyft_nodes,
)
engine = index.as_query_engine(similarity_top_k=10, llm=OpenAI(model="gpt-4"))
final_engine = SubQuestionQueryEngine.from_defaults(
query_engine_tools=[
QueryEngineTool(
query_engine=engine,
metadata=ToolMetadata(
name="sec_filing_documents",
description="financial information on companies.",
),
)
],
question_gen=question_gen,
use_async=True,
)
response = final_engine.query(
"""
What was the cost due to research and development v.s. sales and marketing for uber and lyft in 2019 in millions of USD?
Give your answer as a JSON.
"""
)
print(response.response)
# Correct answer:
# {"Uber": {"Research and Development": 4836, "Sales and Marketing": 4626},
# "Lyft": {"Research and Development": 1505.6, "Sales and Marketing": 814 }}
Generated 4 sub questions. [sec_filing_documents] Q: What was the cost due to research and development for Uber in 2019 [sec_filing_documents] Q: What was the cost due to sales and marketing for Uber in 2019 [sec_filing_documents] Q: What was the cost due to research and development for Lyft in 2019 [sec_filing_documents] Q: What was the cost due to sales and marketing for Lyft in 2019 [sec_filing_documents] A: The cost due to sales and marketing for Uber in 2019 was $4,626 million. [sec_filing_documents] A: The cost due to research and development for Uber in 2019 was $4,836 million. [sec_filing_documents] A: The cost due to sales and marketing for Lyft in 2019 was $814,122 in thousands. [sec_filing_documents] A: The cost of research and development for Lyft in 2019 was $1,505,640 in thousands. { "Uber": { "Research and Development": 4836, "Sales and Marketing": 4626 }, "Lyft": { "Research and Development": 1505.64, "Sales and Marketing": 814.122 } }
结果:正如我们所见,LLM正确地回答了问题。
问题领域识别到的挑战¶
在本例中,我们观察到由向量嵌入提供的搜索质量相当差。这很可能是由于高度密集的财务文档,它们可能不代表模型的训练集。
为了提高搜索质量,可以采用其他基于关键词方法的神经搜索方法,例如 ColBERTv2/PLAID。特别是,这将有助于匹配特定关键词以识别高相关性块。
其他有效的步骤可能包括使用在财务数据集(例如 Bloomberg GPT)上微调的模型。
最后,我们可以通过提供有关文本块所在周围上下文的更多上下文信息来进一步丰富元数据。
本例的改进方向¶
通常,通过对元数据提取准确性以及问答管道的准确性和召回率进行更严格的评估,可以进一步改进本例。此外,纳入更多文档以及完整的文档(这可能提供更难以区分的混淆段落),可以进一步对我们构建的系统进行压力测试,并提出进一步的改进建议。