Pydantic Extractor¶
这里我们测试一下 PydanticProgramExtractor
的能力 - 能够使用大型语言模型(无论是标准的文本补全 LLM 还是函数调用 LLM)提取出一个完整的 Pydantic 对象。
与使用“单个”元数据提取器相比,其优势在于我们可以在一次 LLM 调用中提取多个实体。
设置¶
In [ ]
已复制!
%pip install llama-index-readers-web
%pip install llama-index-program-openai
%pip install llama-index-readers-web %pip install llama-index-program-openai
In [ ]
已复制!
import nest_asyncio
nest_asyncio.apply()
import os
import openai
import nest_asyncio nest_asyncio.apply() import os import openai
In [ ]
已复制!
os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY"
openai.api_key = os.getenv("OPENAI_API_KEY")
os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY" openai.api_key = os.getenv("OPENAI_API_KEY")
设置 Pydantic 模型¶
这里我们定义了一个想要提取的基本结构化模式。它包含:
- entities:文本块中的唯一实体
- summary:文本块的简洁摘要
- contains_number:块是否包含数字
这显然是一个示例模式。我们鼓励您在想要提取的元数据类型上发挥创意!
In [ ]
已复制!
from pydantic import BaseModel, Field
from typing import List
from pydantic import BaseModel, Field from typing import List
In [ ]
已复制!
class NodeMetadata(BaseModel):
"""Node metadata."""
entities: List[str] = Field(
..., description="Unique entities in this text chunk."
)
summary: str = Field(
..., description="A concise summary of this text chunk."
)
contains_number: bool = Field(
...,
description=(
"Whether the text chunk contains any numbers (ints, floats, etc.)"
),
)
class NodeMetadata(BaseModel): """节点元数据""" entities: List[str] = Field( ..., description="此文本块中的唯一实体。" ) summary: str = Field( ..., description="此文本块的简洁摘要。" ) contains_number: bool = Field( ..., description=( "此文本块是否包含任何数字(整数、浮点数等)" ), )
设置提取器¶
这里我们设置元数据提取器。请注意,我们提供了提示模板,以便了解其内部工作原理。
In [ ]
已复制!
from llama_index.program.openai import OpenAIPydanticProgram
from llama_index.core.extractors import PydanticProgramExtractor
EXTRACT_TEMPLATE_STR = """\
Here is the content of the section:
----------------
{context_str}
----------------
Given the contextual information, extract out a {class_name} object.\
"""
openai_program = OpenAIPydanticProgram.from_defaults(
output_cls=NodeMetadata,
prompt_template_str="{input}",
# extract_template_str=EXTRACT_TEMPLATE_STR
)
program_extractor = PydanticProgramExtractor(
program=openai_program, input_key="input", show_progress=True
)
from llama_index.program.openai import OpenAIPydanticProgram from llama_index.core.extractors import PydanticProgramExtractor EXTRACT_TEMPLATE_STR = """\ 这是本节的内容: ---------------- {context_str} ---------------- 根据上下文信息,提取出一个 {class_name} 对象。\ """ openai_program = OpenAIPydanticProgram.from_defaults( output_cls=NodeMetadata, prompt_template_str="{input}", # extract_template_str=EXTRACT_TEMPLATE_STR ) program_extractor = PydanticProgramExtractor( program=openai_program, input_key="input", show_progress=True )
载入数据¶
我们使用 LlamaHub 的 SimpleWebPageReader 载入 Eugene 的文章(https://eugeneyan.com/writing/llm-patterns/)。
In [ ]
已复制!
# load in blog
from llama_index.readers.web import SimpleWebPageReader
from llama_index.core.node_parser import SentenceSplitter
reader = SimpleWebPageReader(html_to_text=True)
docs = reader.load_data(urls=["https://eugeneyan.com/writing/llm-patterns/"])
# 载入博客 from llama_index.readers.web import SimpleWebPageReader from llama_index.core.node_parser import SentenceSplitter reader = SimpleWebPageReader(html_to_text=True) docs = reader.load_data(urls=["https://eugeneyan.com/writing/llm-patterns/"])
In [ ]
已复制!
from llama_index.core.ingestion import IngestionPipeline
node_parser = SentenceSplitter(chunk_size=1024)
pipeline = IngestionPipeline(transformations=[node_parser, program_extractor])
orig_nodes = pipeline.run(documents=docs)
from llama_index.core.ingestion import IngestionPipeline node_parser = SentenceSplitter(chunk_size=1024) pipeline = IngestionPipeline(transformations=[node_parser, program_extractor]) orig_nodes = pipeline.run(documents=docs)
In [ ]
已复制!
orig_nodes
orig_nodes
In [ ]
已复制!
sample_entry = program_extractor.extract(orig_nodes[0:1])[0]
sample_entry = program_extractor.extract(orig_nodes[0:1])[0]
Extracting Pydantic object: 0%| | 0/1 [00:00<?, ?it/s]
In [ ]
已复制!
display(sample_entry)
display(sample_entry)
{'entities': ['eugeneyan', 'HackerNews', 'Karpathy'], 'summary': 'This section discusses practical patterns for integrating large language models (LLMs) into systems & products. It introduces seven key patterns and provides information on evaluations and benchmarks in the field of language modeling.', 'contains_number': True}
In [ ]
已复制!
new_nodes = program_extractor.process_nodes(orig_nodes)
new_nodes = program_extractor.process_nodes(orig_nodes)
Extracting Pydantic object: 0%| | 0/29 [00:00<?, ?it/s]
In [ ]
已复制!
display(new_nodes[5:7])
display(new_nodes[5:7])