提取元数据以改进文档索引和理解¶

在许多情况下，特别是对于长文档，文本块可能缺乏必要的上下文，无法区分该文本块与其他相似的文本块。解决这个问题的一种方法是手动标记数据集或知识库中的每个文本块。然而，对于大量或持续更新的文档集来说，这可能是劳动密集型且耗时的。

为了解决这个问题，我们使用 LLM 提取与文档相关的特定上下文信息，以便更好地帮助检索和语言模型区分相似的段落。

我们通过全新的 Metadata Extractor 模块来实现这一点。

如果您在 Colab 上打开此 Notebook，您可能需要安装 LlamaIndex 🦙。

In [ ]

已复制！

%pip install llama-index-llms-openai
%pip install llama-index-extractors-entity
%pip install llama-index-llms-openai %pip install llama-index-extractors-entity

In [ ]

已复制！

!pip install llama-index
!pip install llama-index

In [ ]

已复制！

import nest_asyncio

nest_asyncio.apply()

import os
import openai

os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY_HERE"
import nest_asyncio nest_asyncio.apply() import os import openai os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY_HERE"

In [ ]

已复制！

from llama_index.llms.openai import OpenAI
from llama_index.core.schema import MetadataMode
from llama_index.llms.openai import OpenAI from llama_index.core.schema import MetadataMode

In [ ]

已复制！

llm = OpenAI(temperature=0.1, model="gpt-3.5-turbo", max_tokens=512)
llm = OpenAI(temperature=0.1, model="gpt-3.5-turbo", max_tokens=512)

我们创建一个节点解析器，用于提取文档标题和与文档块相关的假设性问题嵌入。

我们还将展示如何实例化 SummaryExtractor 和 KeywordExtractor，以及如何基于 BaseExtractor 基类创建自己的自定义提取器

In [ ]

已复制！





from llama_index.core.extractors import (
    SummaryExtractor,
    QuestionsAnsweredExtractor,
    TitleExtractor,
    KeywordExtractor,
    BaseExtractor,
)
from llama_index.extractors.entity import EntityExtractor
from llama_index.core.node_parser import TokenTextSplitter

text_splitter = TokenTextSplitter(
    separator=" ", chunk_size=512, chunk_overlap=128
)


class CustomExtractor(BaseExtractor):
    def extract(self, nodes):
        metadata_list = [
            {
                "custom": (
                    node.metadata["document_title"]
                    + "\n"
                    + node.metadata["excerpt_keywords"]
                )
            }
            for node in nodes
        ]
        return metadata_list


extractors = [
    TitleExtractor(nodes=5, llm=llm),
    QuestionsAnsweredExtractor(questions=3, llm=llm),
    # EntityExtractor(prediction_threshold=0.5),
    # SummaryExtractor(summaries=["prev", "self"], llm=llm),
    # KeywordExtractor(keywords=10, llm=llm),
    # CustomExtractor()
]

transformations = [text_splitter] + extractors
from llama_index.core.extractors import ( SummaryExtractor, QuestionsAnsweredExtractor, TitleExtractor, KeywordExtractor, BaseExtractor, ) from llama_index.extractors.entity import EntityExtractor from llama_index.core.node_parser import TokenTextSplitter text_splitter = TokenTextSplitter( separator=" ", chunk_size=512, chunk_overlap=128 ) class CustomExtractor(BaseExtractor): def extract(self, nodes): metadata_list = [ { "custom": ( node.metadata["document_title"] + "\n" + node.metadata["excerpt_keywords"] ) } for node in nodes ] return metadata_list extractors = [ TitleExtractor(nodes=5, llm=llm), QuestionsAnsweredExtractor(questions=3, llm=llm), # 实体提取器 (预测阈值 0.5) # EntityExtractor(prediction_threshold=0.5), # 摘要提取器 (摘要 ["prev", "self"], LLM=llm) # SummaryExtractor(summaries=["prev", "self"], llm=llm), # 关键词提取器 (关键词 10, LLM=llm) # KeywordExtractor(keywords=10, llm=llm), # 自定义提取器() # CustomExtractor() ] transformations = [text_splitter] + extractors

In [ ]

已复制！

from llama_index.core import SimpleDirectoryReader
from llama_index.core import SimpleDirectoryReader

我们首先分别加载 Uber 和 Lyft 2019 年和 2020 年的 10k 年度 SEC 报告。

In [ ]

已复制！

!mkdir -p data
!wget -O "data/10k-132.pdf" "https://www.dropbox.com/scl/fi/6dlqdk6e2k1mjhi8dee5j/uber.pdf?rlkey=2jyoe49bg2vwdlz30l76czq6g&dl=1"
!wget -O "data/10k-vFinal.pdf" "https://www.dropbox.com/scl/fi/qn7g3vrk5mqb18ko4e5in/lyft.pdf?rlkey=j6jxtjwo8zbstdo4wz3ns8zoj&dl=1"
!mkdir -p data !wget -O "data/10k-132.pdf" "https://www.dropbox.com/scl/fi/6dlqdk6e2k1mjhi8dee5j/uber.pdf?rlkey=2jyoe49bg2vwdlz30l76czq6g&dl=1" !wget -O "data/10k-vFinal.pdf" "https://www.dropbox.com/scl/fi/qn7g3vrk5mqb18ko4e5in/lyft.pdf?rlkey=j6jxtjwo8zbstdo4wz3ns8zoj&dl=1"

In [ ]

已复制！

# Note the uninformative document file name, which may be a common scenario in a production setting
uber_docs = SimpleDirectoryReader(input_files=["data/10k-132.pdf"]).load_data()
uber_front_pages = uber_docs[0:3]
uber_content = uber_docs[63:69]
uber_docs = uber_front_pages + uber_content
# 注意，文档文件名没有提供信息，这在生产环境中可能是常见情况。uber_docs = SimpleDirectoryReader(input_files=["data/10k-132.pdf"]).load_data() uber_front_pages = uber_docs[0:3] uber_content = uber_docs[63:69] uber_docs = uber_front_pages + uber_content

In [ ]

已复制！

from llama_index.core.ingestion import IngestionPipeline

pipeline = IngestionPipeline(transformations=transformations)

uber_nodes = pipeline.run(documents=uber_docs)
from llama_index.core.ingestion import IngestionPipeline pipeline = IngestionPipeline(transformations=transformations) uber_nodes = pipeline.run(documents=uber_docs)

In [ ]

已复制！

uber_nodes[1].metadata
uber_nodes[1].metadata

Out[ ]

{'page_label': '2',
 'file_name': '10k-132.pdf',
 'document_title': 'Exploring the Diverse Landscape of 2019: A Comprehensive Annual Report on Uber Technologies, Inc.',
 'questions_this_excerpt_can_answer': '1. How many countries does Uber operate in?\n2. What is the total gross bookings of Uber in 2019?\n3. How many trips did Uber facilitate in 2019?'}

In [ ]

已复制！





# Note the uninformative document file name, which may be a common scenario in a production setting
lyft_docs = SimpleDirectoryReader(
    input_files=["data/10k-vFinal.pdf"]
).load_data()
lyft_front_pages = lyft_docs[0:3]
lyft_content = lyft_docs[68:73]
lyft_docs = lyft_front_pages + lyft_content
# 注意，文档文件名没有提供信息，这在生产环境中可能是常见情况。lyft_docs = SimpleDirectoryReader( input_files=["data/10k-vFinal.pdf"] ).load_data() lyft_front_pages = lyft_docs[0:3] lyft_content = lyft_docs[68:73] lyft_docs = lyft_front_pages + lyft_content

In [ ]

已复制！

from llama_index.core.ingestion import IngestionPipeline

pipeline = IngestionPipeline(transformations=transformations)

lyft_nodes = pipeline.run(documents=lyft_docs)
from llama_index.core.ingestion import IngestionPipeline pipeline = IngestionPipeline(transformations=transformations) lyft_nodes = pipeline.run(documents=lyft_docs)

In [ ]

已复制！

lyft_nodes[2].metadata
lyft_nodes[2].metadata

Out[ ]

{'page_label': '2',
 'file_name': '10k-vFinal.pdf',
 'document_title': 'Lyft, Inc. Annual Report on Form 10-K for the Fiscal Year Ended December 31, 2020',
 'questions_this_excerpt_can_answer': "1. Has Lyft, Inc. filed a report on and attestation to its management's assessment of the effectiveness of its internal control over financial reporting under Section 404(b) of the Sarbanes-Oxley Act?\n2. Is Lyft, Inc. considered a shell company according to Rule 12b-2 of the Exchange Act?\n3. What was the aggregate market value of Lyft, Inc.'s common stock held by non-affiliates on June 30, 2020?"}

由于我们提出了相当复杂的问题，我们在下面所有问答管道中都使用了子问题查询引擎（subquestion query engine），并提示它更关注检索到的来源的相关性。

In [ ]

已复制！





from llama_index.core.question_gen import LLMQuestionGenerator
from llama_index.core.question_gen.prompts import (
    DEFAULT_SUB_QUESTION_PROMPT_TMPL,
)


question_gen = LLMQuestionGenerator.from_defaults(
    llm=llm,
    prompt_template_str="""
        Follow the example, but instead of giving a question, always prefix the question 
        with: 'By first identifying and quoting the most relevant sources, '. 
        """
    + DEFAULT_SUB_QUESTION_PROMPT_TMPL,
)
from llama_index.core.question_gen import LLMQuestionGenerator from llama_index.core.question_gen.prompts import ( DEFAULT_SUB_QUESTION_PROMPT_TMPL, ) question_gen = LLMQuestionGenerator.from_defaults( llm=llm, prompt_template_str=""" Follow the example, but instead of giving a question, always prefix the question with: 'By first identifying and quoting the most relevant sources, '. """ + DEFAULT_SUB_QUESTION_PROMPT_TMPL, )

在没有额外元数据的情况下查询索引¶

In [ ]

已复制！





from copy import deepcopy

nodes_no_metadata = deepcopy(uber_nodes) + deepcopy(lyft_nodes)
for node in nodes_no_metadata:
    node.metadata = {
        k: node.metadata[k]
        for k in node.metadata
        if k in ["page_label", "file_name"]
    }
print(
    "LLM sees:\n",
    (nodes_no_metadata)[9].get_content(metadata_mode=MetadataMode.LLM),
)
from copy import deepcopy nodes_no_metadata = deepcopy(uber_nodes) + deepcopy(lyft_nodes) for node in nodes_no_metadata: node.metadata = { k: node.metadata[k] for k in node.metadata if k in ["page_label", "file_name"] } print( "LLM sees:\n", (nodes_no_metadata)[9].get_content(metadata_mode=MetadataMode.LLM), )

LLM sees:
 [Excerpt from document]
page_label: 65
file_name: 10k-132.pdf
Excerpt:
-----
See the section titled “Reconciliations of Non-GAAP Financial Measures” for our definition and a 
reconciliation of net income (loss) attributable to  Uber Technologies, Inc. to Adjusted EBITDA. 
            
  Year Ended December 31,   2017 to 2018   2018 to 2019   
(In millions, exce pt percenta ges)  2017   2018   2019   % Chan ge  % Chan ge  
Adjusted EBITDA ................................  $ (2,642) $ (1,847) $ (2,725)  30%  (48)%
-----

In [ ]

已复制！

from llama_index.core import VectorStoreIndex
from llama_index.core.query_engine import SubQuestionQueryEngine
from llama_index.core.tools import QueryEngineTool, ToolMetadata
from llama_index.core import VectorStoreIndex from llama_index.core.query_engine import SubQuestionQueryEngine from llama_index.core.tools import QueryEngineTool, ToolMetadata

In [ ]

已复制！





index_no_metadata = VectorStoreIndex(
    nodes=nodes_no_metadata,
)
engine_no_metadata = index_no_metadata.as_query_engine(
    similarity_top_k=10, llm=OpenAI(model="gpt-4")
)
index_no_metadata = VectorStoreIndex( nodes=nodes_no_metadata, ) engine_no_metadata = index_no_metadata.as_query_engine( similarity_top_k=10, llm=OpenAI(model="gpt-4") )

In [ ]

已复制！





final_engine_no_metadata = SubQuestionQueryEngine.from_defaults(
    query_engine_tools=[
        QueryEngineTool(
            query_engine=engine_no_metadata,
            metadata=ToolMetadata(
                name="sec_filing_documents",
                description="financial information on companies",
            ),
        )
    ],
    question_gen=question_gen,
    use_async=True,
)
final_engine_no_metadata = SubQuestionQueryEngine.from_defaults( query_engine_tools=[ QueryEngineTool( query_engine=engine_no_metadata, metadata=ToolMetadata( name="sec_filing_documents", description="financial information on companies", ), ) ], question_gen=question_gen, use_async=True, )

In [ ]

已复制！





response_no_metadata = final_engine_no_metadata.query(
    """
    What was the cost due to research and development v.s. sales and marketing for uber and lyft in 2019 in millions of USD?
    Give your answer as a JSON.
    """
)
print(response_no_metadata.response)
# Correct answer:
# {"Uber": {"Research and Development": 4836, "Sales and Marketing": 4626},
#  "Lyft": {"Research and Development": 1505.6, "Sales and Marketing": 814 }}
response_no_metadata = final_engine_no_metadata.query( """ 2019年，Uber和Lyft的研发成本与销售和营销成本（以百万美元计）是多少？请以JSON格式给出答案。 """ ) print(response_no_metadata.response) # 正确答案: # {"Uber": {"Research and Development": 4836, "Sales and Marketing": 4626}, # "Lyft": {"Research and Development": 1505.6, "Sales and Marketing": 814 }}

Generated 4 sub questions.
[sec_filing_documents] Q: What was the cost due to research and development for Uber in 2019
[sec_filing_documents] Q: What was the cost due to sales and marketing for Uber in 2019
[sec_filing_documents] Q: What was the cost due to research and development for Lyft in 2019
[sec_filing_documents] Q: What was the cost due to sales and marketing for Lyft in 2019
[sec_filing_documents] A: The cost due to sales and marketing for Uber in 2019 was $814,122 in thousands.
[sec_filing_documents] A: The cost due to research and development for Uber in 2019 was $1,505,640 in thousands.
[sec_filing_documents] A: The cost of research and development for Lyft in 2019 was $1,505,640 in thousands.
[sec_filing_documents] A: The cost due to sales and marketing for Lyft in 2019 was $814,122 in thousands.
{
  "Uber": {
    "Research and Development": 1505.64,
    "Sales and Marketing": 814.122
  },
  "Lyft": {
    "Research and Development": 1505.64,
    "Sales and Marketing": 814.122
  }
}

结果：正如我们所见，问答代理似乎不知道在哪里查找正确的文档。因此，它完全弄混了Lyft和Uber的数据。

在提取的元数据下查询索引¶

In [ ]

已复制！

print(
    "LLM sees:\n",
    (uber_nodes + lyft_nodes)[9].get_content(metadata_mode=MetadataMode.LLM),
)
print( "LLM sees:\n", (uber_nodes + lyft_nodes)[9].get_content(metadata_mode=MetadataMode.LLM), )

LLM sees:
 [Excerpt from document]
page_label: 65
file_name: 10k-132.pdf
document_title: Exploring the Diverse Landscape of 2019: A Comprehensive Annual Report on Uber Technologies, Inc.
Excerpt:
-----
See the section titled “Reconciliations of Non-GAAP Financial Measures” for our definition and a 
reconciliation of net income (loss) attributable to  Uber Technologies, Inc. to Adjusted EBITDA. 
            
  Year Ended December 31,   2017 to 2018   2018 to 2019   
(In millions, exce pt percenta ges)  2017   2018   2019   % Chan ge  % Chan ge  
Adjusted EBITDA ................................  $ (2,642) $ (1,847) $ (2,725)  30%  (48)%
-----

In [ ]

已复制！

index = VectorStoreIndex(
    nodes=uber_nodes + lyft_nodes,
)
engine = index.as_query_engine(similarity_top_k=10, llm=OpenAI(model="gpt-4"))
index = VectorStoreIndex( nodes=uber_nodes + lyft_nodes, ) engine = index.as_query_engine(similarity_top_k=10, llm=OpenAI(model="gpt-4"))

In [ ]

已复制！





final_engine = SubQuestionQueryEngine.from_defaults(
    query_engine_tools=[
        QueryEngineTool(
            query_engine=engine,
            metadata=ToolMetadata(
                name="sec_filing_documents",
                description="financial information on companies.",
            ),
        )
    ],
    question_gen=question_gen,
    use_async=True,
)
final_engine = SubQuestionQueryEngine.from_defaults( query_engine_tools=[ QueryEngineTool( query_engine=engine, metadata=ToolMetadata( name="sec_filing_documents", description="financial information on companies.", ), ) ], question_gen=question_gen, use_async=True, )

In [ ]

已复制！





response = final_engine.query(
    """
    What was the cost due to research and development v.s. sales and marketing for uber and lyft in 2019 in millions of USD?
    Give your answer as a JSON.
    """
)
print(response.response)
# Correct answer:
# {"Uber": {"Research and Development": 4836, "Sales and Marketing": 4626},
#  "Lyft": {"Research and Development": 1505.6, "Sales and Marketing": 814 }}
response = final_engine.query( """ 2019年，Uber和Lyft的研发成本与销售和营销成本（以百万美元计）是多少？请以JSON格式给出答案。 """ ) print(response.response) # 正确答案: # {"Uber": {"Research and Development": 4836, "Sales and Marketing": 4626}, # "Lyft": {"Research and Development": 1505.6, "Sales and Marketing": 814 }}

Generated 4 sub questions.
[sec_filing_documents] Q: What was the cost due to research and development for Uber in 2019
[sec_filing_documents] Q: What was the cost due to sales and marketing for Uber in 2019
[sec_filing_documents] Q: What was the cost due to research and development for Lyft in 2019
[sec_filing_documents] Q: What was the cost due to sales and marketing for Lyft in 2019
[sec_filing_documents] A: The cost due to sales and marketing for Uber in 2019 was $4,626 million.
[sec_filing_documents] A: The cost due to research and development for Uber in 2019 was $4,836 million.
[sec_filing_documents] A: The cost due to sales and marketing for Lyft in 2019 was $814,122 in thousands.
[sec_filing_documents] A: The cost of research and development for Lyft in 2019 was $1,505,640 in thousands.
{
  "Uber": {
    "Research and Development": 4836,
    "Sales and Marketing": 4626
  },
  "Lyft": {
    "Research and Development": 1505.64,
    "Sales and Marketing": 814.122
  }
}

结果：正如我们所见，LLM正确地回答了问题。

问题领域识别到的挑战¶

在本例中，我们观察到由向量嵌入提供的搜索质量相当差。这很可能是由于高度密集的财务文档，它们可能不代表模型的训练集。

为了提高搜索质量，可以采用其他基于关键词方法的神经搜索方法，例如 ColBERTv2/PLAID。特别是，这将有助于匹配特定关键词以识别高相关性块。

其他有效的步骤可能包括使用在财务数据集（例如 Bloomberg GPT）上微调的模型。

最后，我们可以通过提供有关文本块所在周围上下文的更多上下文信息来进一步丰富元数据。

本例的改进方向¶

通常，通过对元数据提取准确性以及问答管道的准确性和召回率进行更严格的评估，可以进一步改进本例。此外，纳入更多文档以及完整的文档（这可能提供更难以区分的混淆段落），可以进一步对我们构建的系统进行压力测试，并提出进一步的改进建议。