GraphRAG(图 + 检索增强生成)结合了检索增强生成(RAG)和查询聚焦摘要(QFS)的优势,能够有效处理大型文本数据集上的复杂查询。RAG 在获取精确信息方面表现出色,但在处理需要主题理解的更广泛查询时会遇到困难,而 QFS 虽然能解决这一挑战但难以很好地扩展。GraphRAG 集成了这些方法,以在广泛、多样化的文本语料库上提供响应迅速且全面的查询能力。
本 notebook 提供了使用 LlamaIndex PropertyGraph 抽象构建 GraphRAG 流水线的指南。
注意: 这是 GraphRAG 的一个近似实现。我们目前正在开发一系列实用指南 (cookbooks),它们将详细介绍 GraphRAG 的确切实现。
GraphRAG 方法¶
GraphRAG 包括两个步骤
图生成 - 在给定文档上创建图、构建社区及其摘要。
- 回答查询 - 使用步骤 1 创建的社区摘要来回答查询。
- 图生成
源文档到文本块: 源文档被分成更小的文本块以便于处理。
文本块到元素实例: 分析每个文本块以识别和提取实体和关系,生成代表这些元素的元组列表。
元素实例到元素摘要: 使用 LLM 将提取的实体和关系汇总为每个元素的描述性文本块。
元素摘要到图社区: 这些实体、关系和摘要形成一个图,随后使用分层 Leiden 等算法将其划分为社区,以建立分层结构。
图社区到社区摘要: LLM 为每个社区生成摘要,提供对数据集整体主题结构和语义的洞察。
回答查询
社区摘要到全局答案: 利用社区的摘要来响应用户查询。这包括生成中间答案,然后将其整合为全面的全局答案。
GraphRAG 流水线组件¶
以下是我们实现构建上述所有过程的不同组件。
源文档到文本块: 使用 SentenceSplitter
实现,分块大小为 1024,块重叠为 20 个 token。
文本块到元素实例 AND 元素实例到元素摘要: 使用
GraphRAGExtractor
实现。元素摘要到图社区 AND 图社区到社区摘要: 使用
GraphRAGStore
实现。社区摘要到全局答案: 使用
GraphQueryEngine
实现。让我们逐一查看这些组件,并构建 GraphRAG 流水线。
安装¶
使用 graspologic
中的 hierarchical_leiden 来构建社区。
In [ ]
!pip install llama-index graspologic numpy==1.24.4 scipy==1.12.0
我们将使用从 Diffbot 检索到的新闻文章样本数据集,Tomaz 已将其方便地发布在 GitHub 上以便于访问。
该数据集包含 2,500 个样本;为了便于实验,我们将使用其中 50 个样本,其中包括新闻文章的 title
和 text
。
import pandas as pd from llama_index.core import Document news = pd.read_csv( "https://raw.githubusercontent.com/tomasonjo/blog-datasets/main/news_articles.csv" )[:50] news.head()
import pandas as pd
from llama_index.core import Document
news = pd.read_csv(
"https://raw.githubusercontent.com/tomasonjo/blog-datasets/main/news_articles.csv"
)[:50]
news.head()
日期 | 文本 | Chevron: Best Of Breed | |
---|---|---|---|
0 | 2031-04-06T01:36:32.000000000+00:00 | JHVEPhoto Like many companies in the O&G secto... | FirstEnergy (NYSE:FE) Posts Earnings Results |
1 | 2030-04-29T06:55:28.000000000+00:00 | FirstEnergy (NYSE:FE – Get Rating) posted its ... | Dáil almost suspended after Sinn Féin TD put p... |
2 | 2023-06-15T14:32:11.000000000+00:00 | The Dáil was almost suspended on Thursday afte... | Epic’s latest tool can animate hyperrealistic ... |
3 | 2023-06-15T14:00:00.000000000+00:00 | Today, Epic is releasing a new tool designed t... | EU to Ban Huawei, ZTE from Internal Commission... |
4 | 2023-06-15T13:50:00:000000000+00:00 | The European Commission is planning to ban equ... | 按 LlamaIndex 要求准备文档 |
documents = [ Document(text=f"{row['title']}: {row['text']}") for i, row in news.iterrows() ]
documents = [
Document(text=f"{row['title']}: {row['text']}")
for i, row in news.iterrows()
]
import os os.environ["OPENAI_API_KEY"] = "sk-..." from llama_index.llms.openai import OpenAI llm = OpenAI(model="gpt-4")
import os
os.environ["OPENAI_API_KEY"] = "sk-..."
from llama_index.llms.openai import OpenAI
llm = OpenAI(model="gpt-4")
GraphRAGExtractor
类旨在使用 LLM 从文本中提取三元组(主语-关系-宾语),并通过为实体和关系添加描述来丰富它们的属性。
此功能类似于 SimpleLLMPathExtractor
,但包含用于处理实体、关系描述的其他增强功能。有关实现指南,您可以查看类似的现有提取器。
以下是其功能的细分:
关键组件
llm:
用于提取的语言模型。
extract_prompt:
用于引导 LLM 提取信息的提示模板。parse_fn:
一个将 LLM 输出解析为结构化数据的函数。max_paths_per_chunk:
限制每个文本块提取的三元组数量。num_workers:
用于并行处理多个文本节点。- 主要方法
__call__:
处理文本节点列表的入口点。
acall:
__call__
的异步版本,用于提高性能。_aextract:
处理每个单独节点的核心方法。- 提取过程
对于每个输入节点(文本块)
它将文本与提取提示一起发送到 LLM。
- 解析 LLM 的响应以提取实体、关系以及实体和关系的描述。
- 实体被转换为 EntityNode 对象。实体描述存储在元数据中。
- 关系被转换为 Relation 对象。关系描述存储在元数据中。
- 这些被添加到节点的元数据中,分别在 KG_NODES_KEY 和 KG_RELATIONS_KEY 下。
- 注意: 在当前实现中,我们仅使用关系描述。在下一次实现中,我们将在检索阶段利用实体描述。
import asyncio import nest_asyncio nest_asyncio.apply() from typing import Any, List, Callable, Optional, Union, Dict from IPython.display import Markdown, display from llama_index.core.async_utils import run_jobs from llama_index.core.indices.property_graph.utils import ( default_parse_triplets_fn, ) from llama_index.core.graph_stores.types import ( EntityNode, KG_NODES_KEY, KG_RELATIONS_KEY, Relation, ) from llama_index.core.llms.llm import LLM from llama_index.core.prompts import PromptTemplate from llama_index.core.prompts.default_prompts import ( DEFAULT_KG_TRIPLET_EXTRACT_PROMPT, ) from llama_index.core.schema import TransformComponent, BaseNode from llama_index.core.bridge.pydantic import BaseModel, Field class GraphRAGExtractor(TransformComponent): """从图中提取三元组。使用 LLM 和简单的提示 + 输出解析从文本中提取路径(即三元组)以及实体、关系描述。参数:llm (LLM):要使用的语言模型。extract_prompt (Union[str, PromptTemplate]):用于提取三元组的提示。parse_fn (callable):一个用于解析语言模型输出的函数。num_workers (int):用于并行处理的工作线程数。max_paths_per_chunk (int):每个块提取的最大路径数。""" llm: LLM extract_prompt: PromptTemplate parse_fn: Callable num_workers: int max_paths_per_chunk: int def __init__( self, llm: Optional[LLM] = None, extract_prompt: Optional[Union[str, PromptTemplate]] = None, parse_fn: Callable = default_parse_triplets_fn, max_paths_per_chunk: int = 10, num_workers: int = 4, ) -> None: """初始化参数""" from llama_index.core import Settings if isinstance(extract_prompt, str): extract_prompt = PromptTemplate(extract_prompt) super().__init__( llm=llm or Settings.llm, extract_prompt=extract_prompt or DEFAULT_KG_TRIPLET_EXTRACT_PROMPT, parse_fn=parse_fn, num_workers=num_workers, max_paths_per_chunk=max_paths_per_chunk, ) @classmethod def class_name(cls) -> str: return "GraphExtractor" def __call__( self, nodes: List[BaseNode], show_progress: bool = False, **kwargs: Any ) -> List[BaseNode]: """从节点提取三元组""" return asyncio.run( self.acall(nodes, show_progress=show_progress, **kwargs) ) async def _aextract(self, node: BaseNode) -> BaseNode: """从节点提取三元组""" assert hasattr(node, "text") text = node.get_content(metadata_mode="llm") try: llm_response = await self.llm.apredict( self.extract_prompt, text=text, max_knowledge_triplets=self.max_paths_per_chunk, ) entities, entities_relationship = self.parse_fn(llm_response) except ValueError: entities = [] entities_relationship = [] existing_nodes = node.metadata.pop(KG_NODES_KEY, []) existing_relations = node.metadata.pop(KG_RELATIONS_KEY, []) metadata = node.metadata.copy() for entity, entity_type, description in entities: metadata[ "entity_description" ] = description # 在当前实现中未使用。但在未来工作中将很有用。 entity_node = EntityNode( name=entity, label=entity_type, properties=metadata ) existing_nodes.append(entity_node) metadata = node.metadata.copy() for triple in entities_relationship: subj, obj, rel, description = triple subj_node = EntityNode(name=subj, properties=metadata) obj_node = EntityNode(name=obj, properties=metadata) metadata["relationship_description"] = description rel_node = Relation( label=rel, source_id=subj_node.id, target_id=obj_node.id, properties=metadata, ) existing_nodes.extend([subj_node, obj_node]) existing_relations.append(rel_node) node.metadata[KG_NODES_KEY] = existing_nodes node.metadata[KG_RELATIONS_KEY] = existing_relations return node async def acall( self, nodes: List[BaseNode], show_progress: bool = False, **kwargs: Any ) -> List[BaseNode]: """异步从节点提取三元组""" jobs = [] for node in nodes: jobs.append(self._aextract(node)) return await run_jobs( jobs, workers=self.num_workers, show_progress=show_progress, desc="从文本提取路径", )
import asyncio
import nest_asyncio
nest_asyncio.apply()
from typing import Any, List, Callable, Optional, Union, Dict
from IPython.display import Markdown, display
from llama_index.core.async_utils import run_jobs
from llama_index.core.indices.property_graph.utils import (
default_parse_triplets_fn,
)
from llama_index.core.graph_stores.types import (
EntityNode,
KG_NODES_KEY,
KG_RELATIONS_KEY,
Relation,
)
from llama_index.core.llms.llm import LLM
from llama_index.core.prompts import PromptTemplate
from llama_index.core.prompts.default_prompts import (
DEFAULT_KG_TRIPLET_EXTRACT_PROMPT,
)
from llama_index.core.schema import TransformComponent, BaseNode
from llama_index.core.bridge.pydantic import BaseModel, Field
class GraphRAGExtractor(TransformComponent):
"""Extract triples from a graph.
Uses an LLM and a simple prompt + output parsing to extract paths (i.e. triples) and entity, relation descriptions from text.
Args:
llm (LLM):
The language model to use.
extract_prompt (Union[str, PromptTemplate]):
The prompt to use for extracting triples.
parse_fn (callable):
A function to parse the output of the language model.
num_workers (int):
The number of workers to use for parallel processing.
max_paths_per_chunk (int):
The maximum number of paths to extract per chunk.
"""
llm: LLM
extract_prompt: PromptTemplate
parse_fn: Callable
num_workers: int
max_paths_per_chunk: int
def __init__(
self,
llm: Optional[LLM] = None,
extract_prompt: Optional[Union[str, PromptTemplate]] = None,
parse_fn: Callable = default_parse_triplets_fn,
max_paths_per_chunk: int = 10,
num_workers: int = 4,
) -> None:
"""Init params."""
from llama_index.core import Settings
if isinstance(extract_prompt, str):
extract_prompt = PromptTemplate(extract_prompt)
super().__init__(
llm=llm or Settings.llm,
extract_prompt=extract_prompt or DEFAULT_KG_TRIPLET_EXTRACT_PROMPT,
parse_fn=parse_fn,
num_workers=num_workers,
max_paths_per_chunk=max_paths_per_chunk,
)
@classmethod
def class_name(cls) -> str:
return "GraphExtractor"
def __call__(
self, nodes: List[BaseNode], show_progress: bool = False, **kwargs: Any
) -> List[BaseNode]:
"""Extract triples from nodes."""
return asyncio.run(
self.acall(nodes, show_progress=show_progress, **kwargs)
)
async def _aextract(self, node: BaseNode) -> BaseNode:
"""Extract triples from a node."""
assert hasattr(node, "text")
text = node.get_content(metadata_mode="llm")
try:
llm_response = await self.llm.apredict(
self.extract_prompt,
text=text,
max_knowledge_triplets=self.max_paths_per_chunk,
)
entities, entities_relationship = self.parse_fn(llm_response)
except ValueError:
entities = []
entities_relationship = []
existing_nodes = node.metadata.pop(KG_NODES_KEY, [])
existing_relations = node.metadata.pop(KG_RELATIONS_KEY, [])
metadata = node.metadata.copy()
for entity, entity_type, description in entities:
metadata[
"entity_description"
] = description # Not used in the current implementation. But will be useful in future work.
entity_node = EntityNode(
name=entity, label=entity_type, properties=metadata
)
existing_nodes.append(entity_node)
metadata = node.metadata.copy()
for triple in entities_relationship:
subj, obj, rel, description = triple
subj_node = EntityNode(name=subj, properties=metadata)
obj_node = EntityNode(name=obj, properties=metadata)
metadata["relationship_description"] = description
rel_node = Relation(
label=rel,
source_id=subj_node.id,
target_id=obj_node.id,
properties=metadata,
)
existing_nodes.extend([subj_node, obj_node])
existing_relations.append(rel_node)
node.metadata[KG_NODES_KEY] = existing_nodes
node.metadata[KG_RELATIONS_KEY] = existing_relations
return node
async def acall(
self, nodes: List[BaseNode], show_progress: bool = False, **kwargs: Any
) -> List[BaseNode]:
"""Extract triples from nodes async."""
jobs = []
for node in nodes:
jobs.append(self._aextract(node))
return await run_jobs(
jobs,
workers=self.num_workers,
show_progress=show_progress,
desc="Extracting paths from text",
)
GraphRAGStore
类是 SimplePropertyGraphStore
类的扩展,旨在实现 GraphRAG 流水线。以下是其关键组件和功能的细分:
该类使用社区检测算法对图中的相关节点进行分组,然后使用 LLM 为每个社区生成摘要。
关键方法
build_communities()
将内部图表示转换为 NetworkX 图。
应用分层 Leiden 算法进行社区检测。
收集每个社区的详细信息。
为每个社区生成摘要。
generate_community_summary(text)
使用 LLM 生成社区中关系的摘要。
- 摘要包含实体名称和关系描述的综合信息。
- _create_nx_graph()
将内部图表示转换为用于社区检测的 NetworkX 图。
- _collect_community_info(nx_graph, clusters)
根据其社区收集每个节点的详细信息。
- 创建社区内每个关系的字符串表示形式。
- _summarize_communities(community_info)
使用 LLM 为每个社区生成并存储摘要。
- get_community_summaries()
返回社区摘要,如果尚未构建则进行构建。
- import re from llama_index.core.graph_stores import SimplePropertyGraphStore import networkx as nx from graspologic.partition import hierarchical_leiden from llama_index.core.llms import ChatMessage class GraphRAGStore(SimplePropertyGraphStore): community_summary = {} max_cluster_size = 5 def generate_community_summary(self, text): """使用 LLM 为给定文本生成摘要""" messages = [ ChatMessage( role="system", content=( "您将获得一组来自知识图谱的关系,每个关系表示为 " "entity1->entity2->relation->relationship_description。您的任务是为这些关系创建一个摘要。摘要应包括涉及的实体名称和关系描述的简洁综合。目标是捕捉最关键和相关的细节,以突出每个关系的性质和重要性。确保摘要连贯,并以强调关系关键方面的方式整合信息。" ), ), ChatMessage(role="user", content=text), ] response = OpenAI().chat(messages) clean_response = re.sub(r"^assistant:\s*", "", str(response)).strip() return clean_response def build_communities(self): """从图中构建社区并对其进行摘要""" nx_graph = self._create_nx_graph() community_hierarchical_clusters = hierarchical_leiden( nx_graph, max_cluster_size=self.max_cluster_size ) community_info = self._collect_community_info( nx_graph, community_hierarchical_clusters ) self._summarize_communities(community_info) def _create_nx_graph(self): """将内部图表示转换为 NetworkX 图""" nx_graph = nx.Graph() for node in self.graph.nodes.values(): nx_graph.add_node(str(node)) for relation in self.graph.relations.values(): nx_graph.add_edge( relation.source_id, relation.target_id, relationship=relation.label, description=relation.properties["relationship_description"], ) return nx_graph def _collect_community_info(self, nx_graph, clusters): """根据社区收集每个节点的详细信息""" community_mapping = {item.node: item.cluster for item in clusters} community_info = {} for item in clusters: cluster_id = item.cluster node = item.node if cluster_id not in community_info: community_info[cluster_id] = [] for neighbor in nx_graph.neighbors(node): if community_mapping[neighbor] == cluster_id: edge_data = nx_graph.get_edge_data(node, neighbor) if edge_data: detail = f"{node} -> {neighbor} -> {edge_data['relationship']} -> {edge_data['description']}" community_info[cluster_id].append(detail) return community_info def _summarize_communities(self, community_info): """为每个社区生成并存储摘要""" for community_id, details in community_info.items(): details_text = ( "\n".join(details) + "." ) # 确保以句点结束 self.community_summary[ community_id ] = self.generate_community_summary(details_text) def get_community_summaries(self): """返回社区摘要,如果尚未构建则构建""" if not self.community_summary: self.build_communities() return self.community_summary
import re
from llama_index.core.graph_stores import SimplePropertyGraphStore
import networkx as nx
from graspologic.partition import hierarchical_leiden
from llama_index.core.llms import ChatMessage
class GraphRAGStore(SimplePropertyGraphStore):
community_summary = {}
max_cluster_size = 5
def generate_community_summary(self, text):
"""Generate summary for a given text using an LLM."""
messages = [
ChatMessage(
role="system",
content=(
"You are provided with a set of relationships from a knowledge graph, each represented as "
"entity1->entity2->relation->relationship_description. Your task is to create a summary of these "
"relationships. The summary should include the names of the entities involved and a concise synthesis "
"of the relationship descriptions. The goal is to capture the most critical and relevant details that "
"highlight the nature and significance of each relationship. Ensure that the summary is coherent and "
"integrates the information in a way that emphasizes the key aspects of the relationships."
),
),
ChatMessage(role="user", content=text),
]
response = OpenAI().chat(messages)
clean_response = re.sub(r"^assistant:\s*", "", str(response)).strip()
return clean_response
def build_communities(self):
"""Builds communities from the graph and summarizes them."""
nx_graph = self._create_nx_graph()
community_hierarchical_clusters = hierarchical_leiden(
nx_graph, max_cluster_size=self.max_cluster_size
)
community_info = self._collect_community_info(
nx_graph, community_hierarchical_clusters
)
self._summarize_communities(community_info)
def _create_nx_graph(self):
"""Converts internal graph representation to NetworkX graph."""
nx_graph = nx.Graph()
for node in self.graph.nodes.values():
nx_graph.add_node(str(node))
for relation in self.graph.relations.values():
nx_graph.add_edge(
relation.source_id,
relation.target_id,
relationship=relation.label,
description=relation.properties["relationship_description"],
)
return nx_graph
def _collect_community_info(self, nx_graph, clusters):
"""Collect detailed information for each node based on their community."""
community_mapping = {item.node: item.cluster for item in clusters}
community_info = {}
for item in clusters:
cluster_id = item.cluster
node = item.node
if cluster_id not in community_info:
community_info[cluster_id] = []
for neighbor in nx_graph.neighbors(node):
if community_mapping[neighbor] == cluster_id:
edge_data = nx_graph.get_edge_data(node, neighbor)
if edge_data:
detail = f"{node} -> {neighbor} -> {edge_data['relationship']} -> {edge_data['description']}"
community_info[cluster_id].append(detail)
return community_info
def _summarize_communities(self, community_info):
"""Generate and store summaries for each community."""
for community_id, details in community_info.items():
details_text = (
"\n".join(details) + "."
) # Ensure it ends with a period
self.community_summary[
community_id
] = self.generate_community_summary(details_text)
def get_community_summaries(self):
"""Returns the community summaries, building them if not already done."""
if not self.community_summary:
self.build_communities()
return self.community_summary
/usr/local/lib/python3.10/dist-packages/graspologic/models/edge_swaps.py:215: NumbaDeprecationWarning: The keyword argument 'nopython=False' was supplied. From Numba 0.59.0 the default is being changed to True and use of 'nopython=False' will raise a warning as the argument will have no effect. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details. _edge_swap_numba = nb.jit(_edge_swap, nopython=False)
GraphRAGQueryEngine
类是一个自定义查询引擎,旨在采用 GraphRAG 方法处理查询。它利用 GraphRAGStore
生成的社区摘要来回答用户查询。以下是其功能的细分:
主要组件
graph_store:
GraphRAGStore
的一个实例,包含社区摘要。llm:
用于生成和聚合答案的语言模型 (LLM)。
custom_query(query_str: str)
build_communities()
这是处理查询的主要入口点。它检索社区摘要,从每个摘要生成答案,然后将这些答案聚合成最终响应。
- generate_answer_from_summary(community_summary, query)
基于单个社区摘要为查询生成答案。使用 LLM 在查询上下文中解释社区摘要。
- aggregate_answers(community_answers)
将来自不同社区的单个答案组合成连贯的最终响应。
- 使用 LLM 将多个视角合成为一个简洁的答案。
- 查询处理流程
从图存储中检索社区摘要。
- 对于每个社区摘要,生成针对查询的特定答案。
- 将所有社区特定答案聚合成最终的、连贯的响应。
- 使用示例
示例用法
query_engine = GraphRAGQueryEngine(graph_store=graph_store, llm=llm)
response = query_engine.query("query")
from llama_index.core.query_engine import CustomQueryEngine
from llama_index.core.llms import LLM
class GraphRAGQueryEngine(CustomQueryEngine):
graph_store: GraphRAGStore
llm: LLM
def custom_query(self, query_str: str) -> str:
"""Process all community summaries to generate answers to a specific query."""
community_summaries = self.graph_store.get_community_summaries()
community_answers = [
self.generate_answer_from_summary(community_summary, query_str)
for _, community_summary in community_summaries.items()
]
final_answer = self.aggregate_answers(community_answers)
return final_answer
def generate_answer_from_summary(self, community_summary, query):
"""Generate an answer from a community summary based on a given query using LLM."""
prompt = (
f"Given the community summary: {community_summary}, "
f"how would you answer the following query? Query: {query}"
)
messages = [
ChatMessage(role="system", content=prompt),
ChatMessage(
role="user",
content="I need an answer based on the above information.",
),
]
response = self.llm.chat(messages)
cleaned_response = re.sub(r"^assistant:\s*", "", str(response)).strip()
return cleaned_response
def aggregate_answers(self, community_answers):
"""Aggregate individual community answers into a final, coherent response."""
# intermediate_text = " ".join(community_answers)
prompt = "Combine the following intermediate answers into a final, concise response."
messages = [
ChatMessage(role="system", content=prompt),
ChatMessage(
role="user",
content=f"Intermediate answers: {community_answers}",
),
]
final_response = self.llm.chat(messages)
cleaned_final_response = re.sub(
r"^assistant:\s*", "", str(final_response)
).strip()
return cleaned_final_response
构建端到端 GraphRAG 管道¶
现在我们已经定义了所有必要的组件,接下来构建 GraphRAG 管道。
- 从文本创建节点/块。
- 使用
GraphRAGExtractor
和GraphRAGStore
构建 PropertyGraphIndex。 - 构建社区,并使用上面构建的图为每个社区生成摘要。
- 创建一个
GraphRAGQueryEngine
并开始查询。
从文本创建节点/块。¶
from llama_index.core.node_parser import SentenceSplitter
splitter = SentenceSplitter(
chunk_size=1024,
chunk_overlap=20,
)
nodes = splitter.get_nodes_from_documents(documents)
len(nodes)
50
使用 GraphRAGExtractor
和 GraphRAGStore
构建 ProperGraphIndex¶
KG_TRIPLET_EXTRACT_TMPL = """
-Goal-
Given a text document, identify all entities and their entity types from the text and all relationships among the identified entities.
Given the text, extract up to {max_knowledge_triplets} entity-relation triplets.
-Steps-
1. Identify all entities. For each identified entity, extract the following information:
- entity_name: Name of the entity, capitalized
- entity_type: Type of the entity
- entity_description: Comprehensive description of the entity's attributes and activities
2. From the entities identified in step 1, identify all pairs of (source_entity, target_entity) that are *clearly related* to each other.
For each pair of related entities, extract the following information:
- source_entity: name of the source entity, as identified in step 1
- target_entity: name of the target entity, as identified in step 1
- relation: relationship between source_entity and target_entity
- relationship_description: explanation as to why you think the source entity and the target entity are related to each other
3. Output Formatting:
- Return the result in valid JSON format with two keys: 'entities' (list of entity objects) and 'relationships' (list of relationship objects).
- Exclude any text outside the JSON structure (e.g., no explanations or comments).
- If no entities or relationships are identified, return empty lists: { "entities": [], "relationships": [] }.
-An Output Example-
{
"entities": [
{
"entity_name": "Albert Einstein",
"entity_type": "Person",
"entity_description": "Albert Einstein was a theoretical physicist who developed the theory of relativity and made significant contributions to physics."
},
{
"entity_name": "Theory of Relativity",
"entity_type": "Scientific Theory",
"entity_description": "A scientific theory developed by Albert Einstein, describing the laws of physics in relation to observers in different frames of reference."
},
{
"entity_name": "Nobel Prize in Physics",
"entity_type": "Award",
"entity_description": "A prestigious international award in the field of physics, awarded annually by the Royal Swedish Academy of Sciences."
}
],
"relationships": [
{
"source_entity": "Albert Einstein",
"target_entity": "Theory of Relativity",
"relation": "developed",
"relationship_description": "Albert Einstein is the developer of the theory of relativity."
},
{
"source_entity": "Albert Einstein",
"target_entity": "Nobel Prize in Physics",
"relation": "won",
"relationship_description": "Albert Einstein won the Nobel Prize in Physics in 1921."
}
]
}
-Real Data-
######################
text: {text}
######################
output:"""
import json
def parse_fn(response_str: str) -> Any:
json_pattern = r"\{.*\}"
match = re.search(json_pattern, response_str, re.DOTALL)
entities = []
relationships = []
if not match:
return entities, relationships
json_str = match.group(0)
try:
data = json.loads(json_str)
entities = [
(
entity["entity_name"],
entity["entity_type"],
entity["entity_description"],
)
for entity in data.get("entities", [])
]
relationships = [
(
relation["source_entity"],
relation["target_entity"],
relation["relation"],
relation["relationship_description"],
)
for relation in data.get("relationships", [])
]
return entities, relationships
except json.JSONDecodeError as e:
print("Error parsing JSON:", e)
return entities, relationships
kg_extractor = GraphRAGExtractor(
llm=llm,
extract_prompt=KG_TRIPLET_EXTRACT_TMPL,
max_paths_per_chunk=2,
parse_fn=parse_fn,
)
from llama_index.core import PropertyGraphIndex
index = PropertyGraphIndex(
nodes=nodes,
property_graph_store=GraphRAGStore(),
kg_extractors=[kg_extractor],
show_progress=True,
)
Extracting paths from text: 100%|██████████| 50/50 [04:30<00:00, 5.41s/it] Generating embeddings: 100%|██████████| 1/1 [00:01<00:00, 1.24s/it] Generating embeddings: 100%|██████████| 4/4 [00:00<00:00, 4.22it/s]
list(index.property_graph_store.graph.nodes.values())[-1]
EntityNode(label='entity', embedding=None, properties={'relationship_description': 'Gett Taxi is a competitor of Uber in the Israeli taxi market.', 'triplet_source_id': 'e4f765e3-fdfd-48d0-92a9-36f75b5865aa'}, name='Competition')
list(index.property_graph_store.graph.relations.values())[0]
Relation(label='O&G sector', source_id='Chevron', target_id='Operates in', properties={'relationship_description': 'Chevron operates in the O&G sector, as evidenced by the text mentioning that it is a company in this industry.', 'triplet_source_id': '6a28dc67-0dc0-486f-8dd6-70a3502f1c8e'})
list(index.property_graph_store.graph.relations.values())[0].properties[
"relationship_description"
]
'Chevron operates in the O&G sector, as evidenced by the text mentioning that it is a company in this industry.'
构建社区¶
这将为每个社区创建社区和摘要。
index.property_graph_store.build_communities()
创建 QueryEngine¶
query_engine = GraphRAGQueryEngine(
graph_store=index.property_graph_store, llm=llm
)
查询¶
response = query_engine.query(
"What are the main news discussed in the document?"
)
display(Markdown(f"{response.response}"))
文档讨论了不同领域的各种新闻话题。在商业领域,提到 FirstEnergy 是一家在纽约证券交易所上市的上市公司,State Street Corporation 也在纽交所上市。文档还讨论了 Coinbase Global Inc. 回购价值 6450 万美元的 0.50% 可转换优先票据以及初创公司 Protonn 的关闭。在政治领域,文档重点介绍了新芬党议员 John Brady 在关于预备消防员的辩论中表演的戏剧性行为。在科技行业,文档讨论了欧盟委员会因安全问题对中兴通讯(ZTE Corp.)和 TikTok Inc. 采取的行动。在体育领域,文档提到了曼联对哈里·凯恩的兴趣、 Jude Bellingham 从多特蒙德转会至皇家马德里,以及 Maliek Collins 与休斯顿德克萨斯人队续约合同的谈判过程。在音乐行业,文档讨论了 BMG 收购 The Hollies 的唱片目录以及 ADA Worldwide 与 Rostrum Records 之间的分销协议。在酒店业,文档提到了 Supplier.io 和 Hyatt Hotels 之间的合作关系。在能源领域,文档讨论了 GE Vernova 和 Amplus Solar 之间的合作关系。在游戏行业,文档讨论了 Square Enix 正在制作未公布的游戏“星之海洋:第二个故事 R”。在汽车行业,文档提到了现代 Exter 即将在印度推出以及 Stellantis 关闭 Belvidere 装配厂的计划。在航空业,文档讨论了德意志银行决定将 Allegiant Travel 的评级从“持有”上调至“买入”。在足球领域,文档讨论了阿森纳对赖斯的报价被拒绝以及切尔西收到的关于 Mason Mount 的报价被拒绝。在航天业,文档提到 MDA Ltd. 参加了 Jefferies 虚拟航天峰会。在交通运输业,文档讨论了 Uber 退出以色列市场的战略决定以及 Yango 成为以色列出租车市场主要参与者的出现。
response = query_engine.query("What are news related to financial sector?")
display(Markdown(f"{response.response}"))
最近与金融领域相关的新闻包括:摩根士丹利聘请 Thomas Christl 共同领导其在欧洲的消费和零售客户业务。KeyBank 在美国西部通过在 American Fork 开设新分支机构扩大了业务,并向 Five.12 Foundation 捐赠了 10,000 美元。BMG 收购了 The Hollies 的唱片目录,Matt Pincus 为 Soundtrack Your Brand 领投了 1500 万美元的增长前投资。Hyatt Hotels 和 Supplier.io 荣获了《Supply & Demand Chain Executive》杂志颁发的 2023 年顶级供应链项目奖。美国银行报告了未保险存款下降,而摩根大通报告未保险存款增加了 1.9%。Coinbase Global Inc. 回购了价值 6450 万美元的 0.50% 可转换优先票据,并决定以约 4550 万美元回购其 2026 年到期的 0.50% 可转换优先票据。德意志银行将 Allegiant Travel 的评级从“持有”上调至“买入”,并将目标价提高至 145 美元。最后,S3 Partners 的董事总经理 Ihor Dusaniwsky 分析了特斯拉公司的股票表现,该公司在电动汽车行业与通用汽车公司建立了重要的合作伙伴关系。
未来工作:¶
本教程是 GraphRAG 的一个近似实现。在未来的教程中,我们计划按以下方式扩展:
- 实现使用实体描述嵌入的检索。
- 集成 Neo4JPropertyGraphStore。
- 计算从社区摘要生成的每个答案的有用性得分,并过滤掉有用性得分为零的答案。
- 执行实体消歧以去除重复实体。
- 实现主张或协变量信息提取、局部搜索和全局搜索技术。