GraphRAG（图 + 检索增强生成）结合了检索增强生成（RAG）和查询聚焦摘要（QFS）的优势，能够有效处理大型文本数据集上的复杂查询。RAG 在获取精确信息方面表现出色，但在处理需要主题理解的更广泛查询时会遇到困难，而 QFS 虽然能解决这一挑战但难以很好地扩展。GraphRAG 集成了这些方法，以在广泛、多样化的文本语料库上提供响应迅速且全面的查询能力。

本 notebook 提供了使用 LlamaIndex PropertyGraph 抽象构建 GraphRAG 流水线的指南。

注意： 这是 GraphRAG 的一个近似实现。我们目前正在开发一系列实用指南 (cookbooks)，它们将详细介绍 GraphRAG 的确切实现。

GraphRAG 方法¶

GraphRAG 包括两个步骤

图生成 - 在给定文档上创建图、构建社区及其摘要。

回答查询 - 使用步骤 1 创建的社区摘要来回答查询。
图生成

源文档到文本块： 源文档被分成更小的文本块以便于处理。

文本块到元素实例： 分析每个文本块以识别和提取实体和关系，生成代表这些元素的元组列表。
元素实例到元素摘要： 使用 LLM 将提取的实体和关系汇总为每个元素的描述性文本块。
元素摘要到图社区： 这些实体、关系和摘要形成一个图，随后使用分层 Leiden 等算法将其划分为社区，以建立分层结构。
图社区到社区摘要： LLM 为每个社区生成摘要，提供对数据集整体主题结构和语义的洞察。
回答查询

社区摘要到全局答案： 利用社区的摘要来响应用户查询。这包括生成中间答案，然后将其整合为全面的全局答案。

GraphRAG 流水线组件¶

以下是我们实现构建上述所有过程的不同组件。

源文档到文本块： 使用 SentenceSplitter 实现，分块大小为 1024，块重叠为 20 个 token。

文本块到元素实例 AND 元素实例到元素摘要： 使用 GraphRAGExtractor 实现。
元素摘要到图社区 AND 图社区到社区摘要： 使用 GraphRAGStore 实现。
社区摘要到全局答案： 使用 GraphQueryEngine 实现。
让我们逐一查看这些组件，并构建 GraphRAG 流水线。

安装¶

使用 `graspologic` 中的 hierarchical_leiden 来构建社区。

In [ ]

已复制！

!pip install llama-index graspologic numpy==1.24.4 scipy==1.12.0

!pip install llama-index graspologic numpy==1.24.4 scipy==1.12.0
加载数据¶

我们将使用从 Diffbot 检索到的新闻文章样本数据集，Tomaz 已将其方便地发布在 GitHub 上以便于访问。

该数据集包含 2,500 个样本；为了便于实验，我们将使用其中 50 个样本，其中包括新闻文章的 title 和 text。

import pandas as pd from llama_index.core import Document news = pd.read_csv( "https://raw.githubusercontent.com/tomasonjo/blog-datasets/main/news_articles.csv" )[:50] news.head()

已复制！

!pip install llama-index graspologic numpy==1.24.4 scipy==1.12.0

import pandas as pd
from llama_index.core import Document

news = pd.read_csv(
    "https://raw.githubusercontent.com/tomasonjo/blog-datasets/main/news_articles.csv"
)[:50]

news.head()
Out[ ]

标题

	日期	文本	Chevron: Best Of Breed
0	2031-04-06T01:36:32.000000000+00:00	JHVEPhoto Like many companies in the O&G secto...	FirstEnergy (NYSE:FE) Posts Earnings Results
1	2030-04-29T06:55:28.000000000+00:00	FirstEnergy (NYSE:FE – Get Rating) posted its ...	Dáil almost suspended after Sinn Féin TD put p...
2	2023-06-15T14:32:11.000000000+00:00	The Dáil was almost suspended on Thursday afte...	Epic’s latest tool can animate hyperrealistic ...
3	2023-06-15T14:00:00.000000000+00:00	Today, Epic is releasing a new tool designed t...	EU to Ban Huawei, ZTE from Internal Commission...
4	2023-06-15T13:50:00:000000000+00:00	The European Commission is planning to ban equ...	按 LlamaIndex 要求准备文档

documents = [ Document(text=f"{row['title']}: {row['text']}") for i, row in news.iterrows() ]

已复制！

!pip install llama-index graspologic numpy==1.24.4 scipy==1.12.0

documents = [
    Document(text=f"{row['title']}: {row['text']}")
    for i, row in news.iterrows()
]
Setup API Key and LLM¶

import os os.environ["OPENAI_API_KEY"] = "sk-..." from llama_index.llms.openai import OpenAI llm = OpenAI(model="gpt-4")

已复制！

!pip install llama-index graspologic numpy==1.24.4 scipy==1.12.0

import os

os.environ["OPENAI_API_KEY"] = "sk-..."

from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-4")
GraphRAGExtractor¶

`GraphRAGExtractor` 类旨在使用 LLM 从文本中提取三元组（主语-关系-宾语），并通过为实体和关系添加描述来丰富它们的属性。

此功能类似于 SimpleLLMPathExtractor，但包含用于处理实体、关系描述的其他增强功能。有关实现指南，您可以查看类似的现有提取器。

以下是其功能的细分：

关键组件

llm: 用于提取的语言模型。

extract_prompt: 用于引导 LLM 提取信息的提示模板。
parse_fn: 一个将 LLM 输出解析为结构化数据的函数。
max_paths_per_chunk: 限制每个文本块提取的三元组数量。
num_workers: 用于并行处理多个文本节点。
主要方法

__call__: 处理文本节点列表的入口点。

acall: __call__ 的异步版本，用于提高性能。
_aextract: 处理每个单独节点的核心方法。
提取过程

对于每个输入节点（文本块）

它将文本与提取提示一起发送到 LLM。

解析 LLM 的响应以提取实体、关系以及实体和关系的描述。
实体被转换为 EntityNode 对象。实体描述存储在元数据中。
关系被转换为 Relation 对象。关系描述存储在元数据中。
这些被添加到节点的元数据中，分别在 KG_NODES_KEY 和 KG_RELATIONS_KEY 下。
注意： 在当前实现中，我们仅使用关系描述。在下一次实现中，我们将在检索阶段利用实体描述。

import asyncio import nest_asyncio nest_asyncio.apply() from typing import Any, List, Callable, Optional, Union, Dict from IPython.display import Markdown, display from llama_index.core.async_utils import run_jobs from llama_index.core.indices.property_graph.utils import ( default_parse_triplets_fn, ) from llama_index.core.graph_stores.types import ( EntityNode, KG_NODES_KEY, KG_RELATIONS_KEY, Relation, ) from llama_index.core.llms.llm import LLM from llama_index.core.prompts import PromptTemplate from llama_index.core.prompts.default_prompts import ( DEFAULT_KG_TRIPLET_EXTRACT_PROMPT, ) from llama_index.core.schema import TransformComponent, BaseNode from llama_index.core.bridge.pydantic import BaseModel, Field class GraphRAGExtractor(TransformComponent): """从图中提取三元组。使用 LLM 和简单的提示 + 输出解析从文本中提取路径（即三元组）以及实体、关系描述。参数：llm (LLM)：要使用的语言模型。extract_prompt (Union[str, PromptTemplate])：用于提取三元组的提示。parse_fn (callable)：一个用于解析语言模型输出的函数。num_workers (int)：用于并行处理的工作线程数。max_paths_per_chunk (int)：每个块提取的最大路径数。""" llm: LLM extract_prompt: PromptTemplate parse_fn: Callable num_workers: int max_paths_per_chunk: int def __init__( self, llm: Optional[LLM] = None, extract_prompt: Optional[Union[str, PromptTemplate]] = None, parse_fn: Callable = default_parse_triplets_fn, max_paths_per_chunk: int = 10, num_workers: int = 4, ) -> None: """初始化参数""" from llama_index.core import Settings if isinstance(extract_prompt, str): extract_prompt = PromptTemplate(extract_prompt) super().__init__( llm=llm or Settings.llm, extract_prompt=extract_prompt or DEFAULT_KG_TRIPLET_EXTRACT_PROMPT, parse_fn=parse_fn, num_workers=num_workers, max_paths_per_chunk=max_paths_per_chunk, ) @classmethod def class_name(cls) -> str: return "GraphExtractor" def __call__( self, nodes: List[BaseNode], show_progress: bool = False, **kwargs: Any ) -> List[BaseNode]: """从节点提取三元组""" return asyncio.run( self.acall(nodes, show_progress=show_progress, **kwargs) ) async def _aextract(self, node: BaseNode) -> BaseNode: """从节点提取三元组""" assert hasattr(node, "text") text = node.get_content(metadata_mode="llm") try: llm_response = await self.llm.apredict( self.extract_prompt, text=text, max_knowledge_triplets=self.max_paths_per_chunk, ) entities, entities_relationship = self.parse_fn(llm_response) except ValueError: entities = [] entities_relationship = [] existing_nodes = node.metadata.pop(KG_NODES_KEY, []) existing_relations = node.metadata.pop(KG_RELATIONS_KEY, []) metadata = node.metadata.copy() for entity, entity_type, description in entities: metadata[ "entity_description" ] = description # 在当前实现中未使用。但在未来工作中将很有用。 entity_node = EntityNode( name=entity, label=entity_type, properties=metadata ) existing_nodes.append(entity_node) metadata = node.metadata.copy() for triple in entities_relationship: subj, obj, rel, description = triple subj_node = EntityNode(name=subj, properties=metadata) obj_node = EntityNode(name=obj, properties=metadata) metadata["relationship_description"] = description rel_node = Relation( label=rel, source_id=subj_node.id, target_id=obj_node.id, properties=metadata, ) existing_nodes.extend([subj_node, obj_node]) existing_relations.append(rel_node) node.metadata[KG_NODES_KEY] = existing_nodes node.metadata[KG_RELATIONS_KEY] = existing_relations return node async def acall( self, nodes: List[BaseNode], show_progress: bool = False, **kwargs: Any ) -> List[BaseNode]: """异步从节点提取三元组""" jobs = [] for node in nodes: jobs.append(self._aextract(node)) return await run_jobs( jobs, workers=self.num_workers, show_progress=show_progress, desc="从文本提取路径", )

已复制！

!pip install llama-index graspologic numpy==1.24.4 scipy==1.12.0





import asyncio
import nest_asyncio

nest_asyncio.apply()

from typing import Any, List, Callable, Optional, Union, Dict
from IPython.display import Markdown, display

from llama_index.core.async_utils import run_jobs
from llama_index.core.indices.property_graph.utils import (
    default_parse_triplets_fn,
)
from llama_index.core.graph_stores.types import (
    EntityNode,
    KG_NODES_KEY,
    KG_RELATIONS_KEY,
    Relation,
)
from llama_index.core.llms.llm import LLM
from llama_index.core.prompts import PromptTemplate
from llama_index.core.prompts.default_prompts import (
    DEFAULT_KG_TRIPLET_EXTRACT_PROMPT,
)
from llama_index.core.schema import TransformComponent, BaseNode
from llama_index.core.bridge.pydantic import BaseModel, Field


class GraphRAGExtractor(TransformComponent):
    """Extract triples from a graph.

    Uses an LLM and a simple prompt + output parsing to extract paths (i.e. triples) and entity, relation descriptions from text.

    Args:
        llm (LLM):
            The language model to use.
        extract_prompt (Union[str, PromptTemplate]):
            The prompt to use for extracting triples.
        parse_fn (callable):
            A function to parse the output of the language model.
        num_workers (int):
            The number of workers to use for parallel processing.
        max_paths_per_chunk (int):
            The maximum number of paths to extract per chunk.
    """

    llm: LLM
    extract_prompt: PromptTemplate
    parse_fn: Callable
    num_workers: int
    max_paths_per_chunk: int

    def __init__(
        self,
        llm: Optional[LLM] = None,
        extract_prompt: Optional[Union[str, PromptTemplate]] = None,
        parse_fn: Callable = default_parse_triplets_fn,
        max_paths_per_chunk: int = 10,
        num_workers: int = 4,
    ) -> None:
        """Init params."""
        from llama_index.core import Settings

        if isinstance(extract_prompt, str):
            extract_prompt = PromptTemplate(extract_prompt)

        super().__init__(
            llm=llm or Settings.llm,
            extract_prompt=extract_prompt or DEFAULT_KG_TRIPLET_EXTRACT_PROMPT,
            parse_fn=parse_fn,
            num_workers=num_workers,
            max_paths_per_chunk=max_paths_per_chunk,
        )

    @classmethod
    def class_name(cls) -> str:
        return "GraphExtractor"

    def __call__(
        self, nodes: List[BaseNode], show_progress: bool = False, **kwargs: Any
    ) -> List[BaseNode]:
        """Extract triples from nodes."""
        return asyncio.run(
            self.acall(nodes, show_progress=show_progress, **kwargs)
        )

    async def _aextract(self, node: BaseNode) -> BaseNode:
        """Extract triples from a node."""
        assert hasattr(node, "text")

        text = node.get_content(metadata_mode="llm")
        try:
            llm_response = await self.llm.apredict(
                self.extract_prompt,
                text=text,
                max_knowledge_triplets=self.max_paths_per_chunk,
            )
            entities, entities_relationship = self.parse_fn(llm_response)
        except ValueError:
            entities = []
            entities_relationship = []

        existing_nodes = node.metadata.pop(KG_NODES_KEY, [])
        existing_relations = node.metadata.pop(KG_RELATIONS_KEY, [])
        metadata = node.metadata.copy()
        for entity, entity_type, description in entities:
            metadata[
                "entity_description"
            ] = description  # Not used in the current implementation. But will be useful in future work.
            entity_node = EntityNode(
                name=entity, label=entity_type, properties=metadata
            )
            existing_nodes.append(entity_node)

        metadata = node.metadata.copy()
        for triple in entities_relationship:
            subj, obj, rel, description = triple
            subj_node = EntityNode(name=subj, properties=metadata)
            obj_node = EntityNode(name=obj, properties=metadata)
            metadata["relationship_description"] = description
            rel_node = Relation(
                label=rel,
                source_id=subj_node.id,
                target_id=obj_node.id,
                properties=metadata,
            )

            existing_nodes.extend([subj_node, obj_node])
            existing_relations.append(rel_node)

        node.metadata[KG_NODES_KEY] = existing_nodes
        node.metadata[KG_RELATIONS_KEY] = existing_relations
        return node

    async def acall(
        self, nodes: List[BaseNode], show_progress: bool = False, **kwargs: Any
    ) -> List[BaseNode]:
        """Extract triples from nodes async."""
        jobs = []
        for node in nodes:
            jobs.append(self._aextract(node))

        return await run_jobs(
            jobs,
            workers=self.num_workers,
            show_progress=show_progress,
            desc="Extracting paths from text",
        )
GraphRAGStore¶

`GraphRAGStore` 类是 `SimplePropertyGraphStore` 类的扩展，旨在实现 GraphRAG 流水线。以下是其关键组件和功能的细分：

该类使用社区检测算法对图中的相关节点进行分组，然后使用 LLM 为每个社区生成摘要。

关键方法

build_communities()

将内部图表示转换为 NetworkX 图。

应用分层 Leiden 算法进行社区检测。
收集每个社区的详细信息。
为每个社区生成摘要。
generate_community_summary(text)

使用 LLM 生成社区中关系的摘要。

摘要包含实体名称和关系描述的综合信息。
_create_nx_graph()

将内部图表示转换为用于社区检测的 NetworkX 图。

_collect_community_info(nx_graph, clusters)

根据其社区收集每个节点的详细信息。

创建社区内每个关系的字符串表示形式。
_summarize_communities(community_info)

使用 LLM 为每个社区生成并存储摘要。

get_community_summaries()

返回社区摘要，如果尚未构建则进行构建。

import re from llama_index.core.graph_stores import SimplePropertyGraphStore import networkx as nx from graspologic.partition import hierarchical_leiden from llama_index.core.llms import ChatMessage class GraphRAGStore(SimplePropertyGraphStore): community_summary = {} max_cluster_size = 5 def generate_community_summary(self, text): """使用 LLM 为给定文本生成摘要""" messages = [ ChatMessage( role="system", content=( "您将获得一组来自知识图谱的关系，每个关系表示为 " "entity1->entity2->relation->relationship_description。您的任务是为这些关系创建一个摘要。摘要应包括涉及的实体名称和关系描述的简洁综合。目标是捕捉最关键和相关的细节，以突出每个关系的性质和重要性。确保摘要连贯，并以强调关系关键方面的方式整合信息。" ), ), ChatMessage(role="user", content=text), ] response = OpenAI().chat(messages) clean_response = re.sub(r"^assistant:\s*", "", str(response)).strip() return clean_response def build_communities(self): """从图中构建社区并对其进行摘要""" nx_graph = self._create_nx_graph() community_hierarchical_clusters = hierarchical_leiden( nx_graph, max_cluster_size=self.max_cluster_size ) community_info = self._collect_community_info( nx_graph, community_hierarchical_clusters ) self._summarize_communities(community_info) def _create_nx_graph(self): """将内部图表示转换为 NetworkX 图""" nx_graph = nx.Graph() for node in self.graph.nodes.values(): nx_graph.add_node(str(node)) for relation in self.graph.relations.values(): nx_graph.add_edge( relation.source_id, relation.target_id, relationship=relation.label, description=relation.properties["relationship_description"], ) return nx_graph def _collect_community_info(self, nx_graph, clusters): """根据社区收集每个节点的详细信息""" community_mapping = {item.node: item.cluster for item in clusters} community_info = {} for item in clusters: cluster_id = item.cluster node = item.node if cluster_id not in community_info: community_info[cluster_id] = [] for neighbor in nx_graph.neighbors(node): if community_mapping[neighbor] == cluster_id: edge_data = nx_graph.get_edge_data(node, neighbor) if edge_data: detail = f"{node} -> {neighbor} -> {edge_data['relationship']} -> {edge_data['description']}" community_info[cluster_id].append(detail) return community_info def _summarize_communities(self, community_info): """为每个社区生成并存储摘要""" for community_id, details in community_info.items(): details_text = ( "\n".join(details) + "." ) # 确保以句点结束 self.community_summary[ community_id ] = self.generate_community_summary(details_text) def get_community_summaries(self): """返回社区摘要，如果尚未构建则构建""" if not self.community_summary: self.build_communities() return self.community_summary

已复制！

!pip install llama-index graspologic numpy==1.24.4 scipy==1.12.0





import re
from llama_index.core.graph_stores import SimplePropertyGraphStore
import networkx as nx
from graspologic.partition import hierarchical_leiden

from llama_index.core.llms import ChatMessage


class GraphRAGStore(SimplePropertyGraphStore):
    community_summary = {}
    max_cluster_size = 5

    def generate_community_summary(self, text):
        """Generate summary for a given text using an LLM."""
        messages = [
            ChatMessage(
                role="system",
                content=(
                    "You are provided with a set of relationships from a knowledge graph, each represented as "
                    "entity1->entity2->relation->relationship_description. Your task is to create a summary of these "
                    "relationships. The summary should include the names of the entities involved and a concise synthesis "
                    "of the relationship descriptions. The goal is to capture the most critical and relevant details that "
                    "highlight the nature and significance of each relationship. Ensure that the summary is coherent and "
                    "integrates the information in a way that emphasizes the key aspects of the relationships."
                ),
            ),
            ChatMessage(role="user", content=text),
        ]
        response = OpenAI().chat(messages)
        clean_response = re.sub(r"^assistant:\s*", "", str(response)).strip()
        return clean_response

    def build_communities(self):
        """Builds communities from the graph and summarizes them."""
        nx_graph = self._create_nx_graph()
        community_hierarchical_clusters = hierarchical_leiden(
            nx_graph, max_cluster_size=self.max_cluster_size
        )
        community_info = self._collect_community_info(
            nx_graph, community_hierarchical_clusters
        )
        self._summarize_communities(community_info)

    def _create_nx_graph(self):
        """Converts internal graph representation to NetworkX graph."""
        nx_graph = nx.Graph()
        for node in self.graph.nodes.values():
            nx_graph.add_node(str(node))
        for relation in self.graph.relations.values():
            nx_graph.add_edge(
                relation.source_id,
                relation.target_id,
                relationship=relation.label,
                description=relation.properties["relationship_description"],
            )
        return nx_graph

    def _collect_community_info(self, nx_graph, clusters):
        """Collect detailed information for each node based on their community."""
        community_mapping = {item.node: item.cluster for item in clusters}
        community_info = {}
        for item in clusters:
            cluster_id = item.cluster
            node = item.node
            if cluster_id not in community_info:
                community_info[cluster_id] = []

            for neighbor in nx_graph.neighbors(node):
                if community_mapping[neighbor] == cluster_id:
                    edge_data = nx_graph.get_edge_data(node, neighbor)
                    if edge_data:
                        detail = f"{node} -> {neighbor} -> {edge_data['relationship']} -> {edge_data['description']}"
                        community_info[cluster_id].append(detail)
        return community_info

    def _summarize_communities(self, community_info):
        """Generate and store summaries for each community."""
        for community_id, details in community_info.items():
            details_text = (
                "\n".join(details) + "."
            )  # Ensure it ends with a period
            self.community_summary[
                community_id
            ] = self.generate_community_summary(details_text)

    def get_community_summaries(self):
        """Returns the community summaries, building them if not already done."""
        if not self.community_summary:
            self.build_communities()
        return self.community_summary
GraphRAGQueryEngine¶

/usr/local/lib/python3.10/dist-packages/graspologic/models/edge_swaps.py:215: NumbaDeprecationWarning: The keyword argument 'nopython=False' was supplied. From Numba 0.59.0 the default is being changed to True and use of 'nopython=False' will raise a warning as the argument will have no effect. See https://numba.readthedocs.cn/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  _edge_swap_numba = nb.jit(_edge_swap, nopython=False)

`GraphRAGQueryEngine` 类是一个自定义查询引擎，旨在采用 GraphRAG 方法处理查询。它利用 `GraphRAGStore` 生成的社区摘要来回答用户查询。以下是其功能的细分：

主要组件

graph_store: GraphRAGStore 的一个实例，包含社区摘要。llm: 用于生成和聚合答案的语言模型 (LLM)。

custom_query(query_str: str)

build_communities()

这是处理查询的主要入口点。它检索社区摘要，从每个摘要生成答案，然后将这些答案聚合成最终响应。

generate_answer_from_summary(community_summary, query)

基于单个社区摘要为查询生成答案。使用 LLM 在查询上下文中解释社区摘要。

aggregate_answers(community_answers)

将来自不同社区的单个答案组合成连贯的最终响应。

使用 LLM 将多个视角合成为一个简洁的答案。
查询处理流程

从图存储中检索社区摘要。

对于每个社区摘要，生成针对查询的特定答案。
将所有社区特定答案聚合成最终的、连贯的响应。
使用示例

示例用法

query_engine = GraphRAGQueryEngine(graph_store=graph_store, llm=llm)

response = query_engine.query("query")

已复制！

!pip install llama-index graspologic numpy==1.24.4 scipy==1.12.0





from llama_index.core.query_engine import CustomQueryEngine
from llama_index.core.llms import LLM


class GraphRAGQueryEngine(CustomQueryEngine):
    graph_store: GraphRAGStore
    llm: LLM

    def custom_query(self, query_str: str) -> str:
        """Process all community summaries to generate answers to a specific query."""
        community_summaries = self.graph_store.get_community_summaries()
        community_answers = [
            self.generate_answer_from_summary(community_summary, query_str)
            for _, community_summary in community_summaries.items()
        ]

        final_answer = self.aggregate_answers(community_answers)
        return final_answer

    def generate_answer_from_summary(self, community_summary, query):
        """Generate an answer from a community summary based on a given query using LLM."""
        prompt = (
            f"Given the community summary: {community_summary}, "
            f"how would you answer the following query? Query: {query}"
        )
        messages = [
            ChatMessage(role="system", content=prompt),
            ChatMessage(
                role="user",
                content="I need an answer based on the above information.",
            ),
        ]
        response = self.llm.chat(messages)
        cleaned_response = re.sub(r"^assistant:\s*", "", str(response)).strip()
        return cleaned_response

    def aggregate_answers(self, community_answers):
        """Aggregate individual community answers into a final, coherent response."""
        # intermediate_text = " ".join(community_answers)
        prompt = "Combine the following intermediate answers into a final, concise response."
        messages = [
            ChatMessage(role="system", content=prompt),
            ChatMessage(
                role="user",
                content=f"Intermediate answers: {community_answers}",
            ),
        ]
        final_response = self.llm.chat(messages)
        cleaned_final_response = re.sub(
            r"^assistant:\s*", "", str(final_response)
        ).strip()
        return cleaned_final_response
from llama_index.core.query_engine import CustomQueryEngine from llama_index.core.llms import LLM class GraphRAGQueryEngine(CustomQueryEngine): graph_store: GraphRAGStore llm: LLM def custom_query(self, query_str: str) -> str: """处理所有社区摘要，生成对特定查询的回答。""" community_summaries = self.graph_store.get_community_summaries() community_answers = [ self.generate_answer_from_summary(community_summary, query_str) for _, community_summary in community_summaries.items() ] final_answer = self.aggregate_answers(community_answers) return final_answer def generate_answer_from_summary(self, community_summary, query): """使用 LLM 根据给定查询从社区摘要中生成答案。""" prompt = ( f"给定社区摘要：{community_summary}，" f"您会如何回答以下查询？查询：{query}" ) messages = [ ChatMessage(role="system", content=prompt), ChatMessage( role="user", content="我需要基于上述信息的回答。", ), ] response = self.llm.chat(messages) cleaned_response = re.sub(r"^assistant:\s*", "", str(response)).strip() return cleaned_response def aggregate_answers(self, community_answers): """将单个社区答案聚合成最终的、连贯的回答。""" # intermediate_text = " ".join(community_answers) prompt = "将以下中间答案合并成最终的、简洁的回答。" messages = [ ChatMessage(role="system", content=prompt), ChatMessage( role="user", content=f"中间答案：{community_answers}", ), ] final_response = self.llm.chat(messages) cleaned_final_response = re.sub( r"^assistant:\s*", "", str(final_response) ).strip() return cleaned_final_response

构建端到端 GraphRAG 管道¶

现在我们已经定义了所有必要的组件，接下来构建 GraphRAG 管道。

从文本创建节点/块。
使用 GraphRAGExtractor 和 GraphRAGStore 构建 PropertyGraphIndex。
构建社区，并使用上面构建的图为每个社区生成摘要。
创建一个 GraphRAGQueryEngine 并开始查询。

从文本创建节点/块。¶

已复制！

!pip install llama-index graspologic numpy==1.24.4 scipy==1.12.0

from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(
    chunk_size=1024,
    chunk_overlap=20,
)
nodes = splitter.get_nodes_from_documents(documents)
from llama_index.core.node_parser import SentenceSplitter splitter = SentenceSplitter( chunk_size=1024, chunk_overlap=20, ) nodes = splitter.get_nodes_from_documents(documents)

已复制！

!pip install llama-index graspologic numpy==1.24.4 scipy==1.12.0

len(nodes)
len(nodes)

标题

使用 `GraphRAGExtractor` 和 `GraphRAGStore` 构建 ProperGraphIndex¶

已复制！

!pip install llama-index graspologic numpy==1.24.4 scipy==1.12.0





KG_TRIPLET_EXTRACT_TMPL = """
-Goal-
Given a text document, identify all entities and their entity types from the text and all relationships among the identified entities.
Given the text, extract up to {max_knowledge_triplets} entity-relation triplets.

-Steps-
1. Identify all entities. For each identified entity, extract the following information:
- entity_name: Name of the entity, capitalized
- entity_type: Type of the entity
- entity_description: Comprehensive description of the entity's attributes and activities

2. From the entities identified in step 1, identify all pairs of (source_entity, target_entity) that are *clearly related* to each other.
For each pair of related entities, extract the following information:
- source_entity: name of the source entity, as identified in step 1
- target_entity: name of the target entity, as identified in step 1
- relation: relationship between source_entity and target_entity
- relationship_description: explanation as to why you think the source entity and the target entity are related to each other

3. Output Formatting:
- Return the result in valid JSON format with two keys: 'entities' (list of entity objects) and 'relationships' (list of relationship objects).
- Exclude any text outside the JSON structure (e.g., no explanations or comments).
- If no entities or relationships are identified, return empty lists: { "entities": [], "relationships": [] }.

-An Output Example-
{
  "entities": [
    {
      "entity_name": "Albert Einstein",
      "entity_type": "Person",
      "entity_description": "Albert Einstein was a theoretical physicist who developed the theory of relativity and made significant contributions to physics."
    },
    {
      "entity_name": "Theory of Relativity",
      "entity_type": "Scientific Theory",
      "entity_description": "A scientific theory developed by Albert Einstein, describing the laws of physics in relation to observers in different frames of reference."
    },
    {
      "entity_name": "Nobel Prize in Physics",
      "entity_type": "Award",
      "entity_description": "A prestigious international award in the field of physics, awarded annually by the Royal Swedish Academy of Sciences."
    }
  ],
  "relationships": [
    {
      "source_entity": "Albert Einstein",
      "target_entity": "Theory of Relativity",
      "relation": "developed",
      "relationship_description": "Albert Einstein is the developer of the theory of relativity."
    },
    {
      "source_entity": "Albert Einstein",
      "target_entity": "Nobel Prize in Physics",
      "relation": "won",
      "relationship_description": "Albert Einstein won the Nobel Prize in Physics in 1921."
    }
  ]
}

-Real Data-
######################
text: {text}
######################
output:"""
KG_TRIPLET_EXTRACT_TMPL = """ -目标- 给定文本文档，从文本中识别所有实体及其实体类型，以及所识别实体之间的所有关系。根据文本，最多提取 {max_knowledge_triplets} 个实体-关系三元组。 -步骤- 1. 识别所有实体。对于每个识别出的实体，提取以下信息： - entity_name：实体名称，首字母大写 - entity_type：实体类型 - entity_description：实体属性和活动的详细描述 2. 从步骤 1 中识别出的实体中，识别所有 *明确相关* 的 (source_entity, target_entity) 对。对于每对相关实体，提取以下信息： - source_entity：源实体名称，与步骤 1 中识别的名称相同 - target_entity：目标实体名称，与步骤 1 中识别的名称相同 - relation：source_entity 和 target_entity 之间的关系 - relationship_description：解释为什么认为源实体和目标实体相互关联 3. 输出格式： - 以有效的 JSON 格式返回结果，包含两个键：'entities'（实体对象列表）和 'relationships'（关系对象列表）。 - 不包含 JSON 结构以外的任何文本（例如，不包含解释或注释）。 - 如果未识别出实体或关系，则返回空列表：{ "entities": [], "relationships": [] }。 -输出示例- { "entities": [ { "entity_name": "Albert Einstein", "entity_type": "Person", "entity_description": "Albert Einstein was a theoretical physicist who developed the theory of relativity and made significant contributions to physics." }, { "entity_name": "Theory of Relativity", "entity_type": "Scientific Theory", "entity_description": "A scientific theory developed by Albert Einstein, describing the laws of physics in relation to observers in different frames of reference." }, { "entity_name": "Nobel Prize in Physics", "entity_type": "Award", "entity_description": "A prestigious international award in the field of physics, awarded annually by the Royal Swedish Academy of Sciences." } ], "relationships": [ { "source_entity": "Albert Einstein", "target_entity": "Theory of Relativity", "relation": "developed", "relationship_description": "Albert Einstein is the developer of the theory of relativity." }, { "source_entity": "Albert Einstein", "target_entity": "Nobel Prize in Physics", "relation": "won", "relationship_description": "Albert Einstein won the Nobel Prize in Physics in 1921." } ] } -实际数据- ###################### text: {text} ###################### output:"""

已复制！

!pip install llama-index graspologic numpy==1.24.4 scipy==1.12.0





import json


def parse_fn(response_str: str) -> Any:
    json_pattern = r"\{.*\}"
    match = re.search(json_pattern, response_str, re.DOTALL)
    entities = []
    relationships = []
    if not match:
        return entities, relationships
    json_str = match.group(0)
    try:
        data = json.loads(json_str)
        entities = [
            (
                entity["entity_name"],
                entity["entity_type"],
                entity["entity_description"],
            )
            for entity in data.get("entities", [])
        ]
        relationships = [
            (
                relation["source_entity"],
                relation["target_entity"],
                relation["relation"],
                relation["relationship_description"],
            )
            for relation in data.get("relationships", [])
        ]
        return entities, relationships
    except json.JSONDecodeError as e:
        print("Error parsing JSON:", e)
        return entities, relationships


kg_extractor = GraphRAGExtractor(
    llm=llm,
    extract_prompt=KG_TRIPLET_EXTRACT_TMPL,
    max_paths_per_chunk=2,
    parse_fn=parse_fn,
)
import json def parse_fn(response_str: str) -> Any: json_pattern = r"\{.*\}" match = re.search(json_pattern, response_str, re.DOTALL) entities = [] relationships = [] if not match: return entities, relationships json_str = match.group(0) try: data = json.loads(json_str) entities = [ ( entity["entity_name"], entity["entity_type"], entity["entity_description"], ) for entity in data.get("entities", []) ] relationships = [ ( relation["source_entity"], relation["target_entity"], relation["relation"], relation["relationship_description"], ) for relation in data.get("relationships", []) ] return entities, relationships except json.JSONDecodeError as e: print("解析 JSON 时出错：", e) return entities, relationships kg_extractor = GraphRAGExtractor( llm=llm, extract_prompt=KG_TRIPLET_EXTRACT_TMPL, max_paths_per_chunk=2, parse_fn=parse_fn, )

已复制！

!pip install llama-index graspologic numpy==1.24.4 scipy==1.12.0





from llama_index.core import PropertyGraphIndex

index = PropertyGraphIndex(
    nodes=nodes,
    property_graph_store=GraphRAGStore(),
    kg_extractors=[kg_extractor],
    show_progress=True,
)
from llama_index.core import PropertyGraphIndex index = PropertyGraphIndex( nodes=nodes, property_graph_store=GraphRAGStore(), kg_extractors=[kg_extractor], show_progress=True, )

Extracting paths from text: 100%|██████████| 50/50 [04:30<00:00,  5.41s/it]
Generating embeddings: 100%|██████████| 1/1 [00:01<00:00,  1.24s/it]
Generating embeddings: 100%|██████████| 4/4 [00:00<00:00,  4.22it/s]

已复制！

!pip install llama-index graspologic numpy==1.24.4 scipy==1.12.0

list(index.property_graph_store.graph.nodes.values())[-1]
list(index.property_graph_store.graph.nodes.values())[-1]

标题

EntityNode(label='entity', embedding=None, properties={'relationship_description': 'Gett Taxi is a competitor of Uber in the Israeli taxi market.', 'triplet_source_id': 'e4f765e3-fdfd-48d0-92a9-36f75b5865aa'}, name='Competition')

已复制！

!pip install llama-index graspologic numpy==1.24.4 scipy==1.12.0

list(index.property_graph_store.graph.relations.values())[0]
list(index.property_graph_store.graph.relations.values())[0]

标题

Relation(label='O&G sector', source_id='Chevron', target_id='Operates in', properties={'relationship_description': 'Chevron operates in the O&G sector, as evidenced by the text mentioning that it is a company in this industry.', 'triplet_source_id': '6a28dc67-0dc0-486f-8dd6-70a3502f1c8e'})

已复制！

!pip install llama-index graspologic numpy==1.24.4 scipy==1.12.0

list(index.property_graph_store.graph.relations.values())[0].properties[
    "relationship_description"
]
list(index.property_graph_store.graph.relations.values())[0].properties[ "relationship_description" ]

标题

'Chevron operates in the O&G sector, as evidenced by the text mentioning that it is a company in this industry.'

构建社区¶

这将为每个社区创建社区和摘要。

已复制！

!pip install llama-index graspologic numpy==1.24.4 scipy==1.12.0

index.property_graph_store.build_communities()
index.property_graph_store.build_communities()

创建 QueryEngine¶

已复制！

!pip install llama-index graspologic numpy==1.24.4 scipy==1.12.0

query_engine = GraphRAGQueryEngine(
    graph_store=index.property_graph_store, llm=llm
)
query_engine = GraphRAGQueryEngine( graph_store=index.property_graph_store, llm=llm )

查询¶

已复制！

!pip install llama-index graspologic numpy==1.24.4 scipy==1.12.0

response = query_engine.query(
    "What are the main news discussed in the document?"
)
display(Markdown(f"{response.response}"))
response = query_engine.query( "文档中讨论的主要新闻是什么？" ) display(Markdown(f"{response.response}"))

文档讨论了不同领域的各种新闻话题。在商业领域，提到 FirstEnergy 是一家在纽约证券交易所上市的上市公司，State Street Corporation 也在纽交所上市。文档还讨论了 Coinbase Global Inc. 回购价值 6450 万美元的 0.50% 可转换优先票据以及初创公司 Protonn 的关闭。在政治领域，文档重点介绍了新芬党议员 John Brady 在关于预备消防员的辩论中表演的戏剧性行为。在科技行业，文档讨论了欧盟委员会因安全问题对中兴通讯（ZTE Corp.）和 TikTok Inc. 采取的行动。在体育领域，文档提到了曼联对哈里·凯恩的兴趣、 Jude Bellingham 从多特蒙德转会至皇家马德里，以及 Maliek Collins 与休斯顿德克萨斯人队续约合同的谈判过程。在音乐行业，文档讨论了 BMG 收购 The Hollies 的唱片目录以及 ADA Worldwide 与 Rostrum Records 之间的分销协议。在酒店业，文档提到了 Supplier.io 和 Hyatt Hotels 之间的合作关系。在能源领域，文档讨论了 GE Vernova 和 Amplus Solar 之间的合作关系。在游戏行业，文档讨论了 Square Enix 正在制作未公布的游戏“星之海洋：第二个故事 R”。在汽车行业，文档提到了现代 Exter 即将在印度推出以及 Stellantis 关闭 Belvidere 装配厂的计划。在航空业，文档讨论了德意志银行决定将 Allegiant Travel 的评级从“持有”上调至“买入”。在足球领域，文档讨论了阿森纳对赖斯的报价被拒绝以及切尔西收到的关于 Mason Mount 的报价被拒绝。在航天业，文档提到 MDA Ltd. 参加了 Jefferies 虚拟航天峰会。在交通运输业，文档讨论了 Uber 退出以色列市场的战略决定以及 Yango 成为以色列出租车市场主要参与者的出现。

已复制！

!pip install llama-index graspologic numpy==1.24.4 scipy==1.12.0

response = query_engine.query("What are news related to financial sector?")
display(Markdown(f"{response.response}"))
response = query_engine.query("有哪些与金融领域相关的新闻？") display(Markdown(f"{response.response}"))

最近与金融领域相关的新闻包括：摩根士丹利聘请 Thomas Christl 共同领导其在欧洲的消费和零售客户业务。KeyBank 在美国西部通过在 American Fork 开设新分支机构扩大了业务，并向 Five.12 Foundation 捐赠了 10,000 美元。BMG 收购了 The Hollies 的唱片目录，Matt Pincus 为 Soundtrack Your Brand 领投了 1500 万美元的增长前投资。Hyatt Hotels 和 Supplier.io 荣获了《Supply & Demand Chain Executive》杂志颁发的 2023 年顶级供应链项目奖。美国银行报告了未保险存款下降，而摩根大通报告未保险存款增加了 1.9%。Coinbase Global Inc. 回购了价值 6450 万美元的 0.50% 可转换优先票据，并决定以约 4550 万美元回购其 2026 年到期的 0.50% 可转换优先票据。德意志银行将 Allegiant Travel 的评级从“持有”上调至“买入”，并将目标价提高至 145 美元。最后，S3 Partners 的董事总经理 Ihor Dusaniwsky 分析了特斯拉公司的股票表现，该公司在电动汽车行业与通用汽车公司建立了重要的合作伙伴关系。

未来工作：¶

本教程是 GraphRAG 的一个近似实现。在未来的教程中，我们计划按以下方式扩展：

实现使用实体描述嵌入的检索。
集成 Neo4JPropertyGraphStore。
计算从社区摘要生成的每个答案的有用性得分，并过滤掉有用性得分为零的答案。
执行实体消歧以去除重复实体。
实现主张或协变量信息提取、局部搜索和全局搜索技术。

GraphRAG 包括两个步骤

以下是我们实现构建上述所有过程的不同组件。

使用 graspologic 中的 hierarchical_leiden 来构建社区。

我们将使用从 Diffbot 检索到的新闻文章样本数据集，Tomaz 已将其方便地发布在 GitHub 上以便于访问。

import os os.environ["OPENAI_API_KEY"] = "sk-..." from llama_index.llms.openai import OpenAI llm = OpenAI(model="gpt-4")

GraphRAGExtractor 类旨在使用 LLM 从文本中提取三元组（主语-关系-宾语），并通过为实体和关系添加描述来丰富它们的属性。

GraphRAGStore 类是 SimplePropertyGraphStore 类的扩展，旨在实现 GraphRAG 流水线。以下是其关键组件和功能的细分：

GraphRAGQueryEngine 类是一个自定义查询引擎，旨在采用 GraphRAG 方法处理查询。它利用 GraphRAGStore 生成的社区摘要来回答用户查询。以下是其功能的细分：

构建端到端 GraphRAG 管道¶

从文本创建节点/块。¶

使用 GraphRAGExtractor 和 GraphRAGStore 构建 ProperGraphIndex¶

构建社区¶

创建 QueryEngine¶

查询¶

未来工作：¶

使用 `graspologic` 中的 hierarchical_leiden 来构建社区。

`GraphRAGExtractor` 类旨在使用 LLM 从文本中提取三元组（主语-关系-宾语），并通过为实体和关系添加描述来丰富它们的属性。

`GraphRAGStore` 类是 `SimplePropertyGraphStore` 类的扩展，旨在实现 GraphRAG 流水线。以下是其关键组件和功能的细分：

`GraphRAGQueryEngine` 类是一个自定义查询引擎，旨在采用 GraphRAG 方法处理查询。它利用 `GraphRAGStore` 生成的社区摘要来回答用户查询。以下是其功能的细分：

使用 `GraphRAGExtractor` 和 `GraphRAGStore` 构建 ProperGraphIndex¶