GraphRAG 在 LlamaIndex 中的实现 - V2¶

GraphRAG (图谱 + 检索增强生成) 结合了检索增强生成 (RAG) 和查询聚焦总结 (QFS) 的优势，可有效处理大型文本数据集上的复杂查询。虽然 RAG 擅长获取精确信息，但在需要主题理解的更广泛查询方面表现不佳，而这是 QFS 可以解决但无法很好地扩展的挑战。GraphRAG 集成了这些方法，以在广泛、多样化的文本语料库中提供响应迅速且全面的查询能力。

本 notebook 提供了使用 Neo4J 和 LlamaIndex PropertyGraph 抽象构建 GraphRAG 管道的指南。

本 notebook 将 GraphRAG 管道更新到 v2。如果您还没有查看 v1，可以在此处找到。以下是现有实现的更新内容：

与 Neo4J 图谱数据库集成。
基于嵌入的检索。

安装¶

使用 `graspologic` 构建社区层次化的 Leiden 算法。

In [ ]

已复制！

!pip install llama-index llama-index-graph-stores-neo4j graspologic numpy==1.24.4 scipy==1.12.0 future
!pip install llama-index llama-index-graph-stores-neo4j graspologic numpy==1.24.4 scipy==1.12.0 future

加载数据¶

我们将使用从 Diffbot 检索到的新闻文章样本数据集，Tomaz 已方便地将其放在 GitHub 上以便轻松访问。

数据集包含 2,500 个样本；为了便于实验，我们将使用其中 50 个样本，其中包括新闻文章的 title 和 text。

In [ ]

已复制！

import pandas as pd
from llama_index.core import Document

news = pd.read_csv(
    "https://raw.githubusercontent.com/tomasonjo/blog-datasets/main/news_articles.csv"
)[:50]

news.head()
import pandas as pd from llama_index.core import Document news = pd.read_csv( "https://raw.githubusercontent.com/tomasonjo/blog-datasets/main/news_articles.csv" )[:50] news.head()

Out[ ]

	标题	日期	文本
0	雪佛龙：行业最佳	2031-04-06T01:36:32.000000000+00:00	JHVEPhoto 和 O&G 行业的许多公司一样...
1	FirstEnergy (NYSE:FE) 公布收益报告	2030-04-29T06:55:28.000000000+00:00	FirstEnergy (NYSE:FE – 获取评级) 公布了...
2	Sinn Féin TD 发言后 Dáil 几乎被暂停...	2023-06-15T14:32:11.000000000+00:00	周四，Sinn Féin TD 发表了...
3	Epic 最新工具可为超现实主义... 动画	2023-06-15T14:00:00.000000000+00:00	今天，Epic 发布了一款新工具，旨在...
4	欧盟将禁止华为、中兴进入欧盟委员会内部...	2023-06-15T13:50:00.000000000+00:00	欧盟委员会正计划禁止...

按照 LlamaIndex 要求准备文档

In [ ]

已复制！

documents = [
    Document(text=f"{row['title']}: {row['text']}")
    for i, row in news.iterrows()
]
documents = [ Document(text=f"{row['title']}: {row['text']}") for i, row in news.iterrows() ]

设置 API 密钥和 LLM¶

In [ ]

已复制！

import os

os.environ["OPENAI_API_KEY"] = "sk-.."

from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-4")
import os os.environ["OPENAI_API_KEY"] = "sk-.." from llama_index.llms.openai import OpenAI llm = OpenAI(model="gpt-4")

GraphRAGExtractor¶

GraphRAGExtractor 类旨在从文本中提取三元组 (主语-谓语-宾语)，并通过使用 LLM 为实体和关系添加描述来丰富它们。

此功能与 SimpleLLMPathExtractor 类似，但包含额外的增强功能，可处理实体、关系描述。有关实现的指南，您可以查看类似的现有提取器。

以下是其功能的细分：

关键组件

llm: 用于提取的语言模型。
extract_prompt: 用于指导 LLM 提取信息的提示模板。
parse_fn: 用于将 LLM 输出解析为结构化数据的函数。
max_paths_per_chunk: 限制每个文本块提取的三元组数量。
num_workers: 用于并行处理多个文本节点。

主要方法

__call__: 处理文本节点列表的入口点。
acall: call 的异步版本，用于提高性能。
_aextract: 处理每个独立节点的核心方法。

提取过程

对于每个输入节点 (文本块)：

它将文本连同提取提示发送给 LLM。
解析 LLM 的响应以提取实体、关系以及实体和关系的描述。
实体转换为 EntityNode 对象。实体描述存储在元数据中。
关系转换为 Relation 对象。关系描述存储在元数据中。
这些将添加到节点的元数据中，分别位于 KG_NODES_KEY 和 KG_RELATIONS_KEY 下。

注意：在当前实现中，我们仅使用关系描述。在下一个实现中，我们将在检索阶段利用实体描述。

In [ ]

已复制！





import asyncio
import nest_asyncio

nest_asyncio.apply()

from typing import Any, List, Callable, Optional, Union, Dict
from IPython.display import Markdown, display

from llama_index.core.async_utils import run_jobs
from llama_index.core.indices.property_graph.utils import (
    default_parse_triplets_fn,
)
from llama_index.core.graph_stores.types import (
    EntityNode,
    KG_NODES_KEY,
    KG_RELATIONS_KEY,
    Relation,
)
from llama_index.core.llms.llm import LLM
from llama_index.core.prompts import PromptTemplate
from llama_index.core.prompts.default_prompts import (
    DEFAULT_KG_TRIPLET_EXTRACT_PROMPT,
)
from llama_index.core.schema import TransformComponent, BaseNode
from llama_index.core.bridge.pydantic import BaseModel, Field


class GraphRAGExtractor(TransformComponent):
    """Extract triples from a graph.

    Uses an LLM and a simple prompt + output parsing to extract paths (i.e. triples) and entity, relation descriptions from text.

    Args:
        llm (LLM):
            The language model to use.
        extract_prompt (Union[str, PromptTemplate]):
            The prompt to use for extracting triples.
        parse_fn (callable):
            A function to parse the output of the language model.
        num_workers (int):
            The number of workers to use for parallel processing.
        max_paths_per_chunk (int):
            The maximum number of paths to extract per chunk.
    """

    llm: LLM
    extract_prompt: PromptTemplate
    parse_fn: Callable
    num_workers: int
    max_paths_per_chunk: int

    def __init__(
        self,
        llm: Optional[LLM] = None,
        extract_prompt: Optional[Union[str, PromptTemplate]] = None,
        parse_fn: Callable = default_parse_triplets_fn,
        max_paths_per_chunk: int = 10,
        num_workers: int = 4,
    ) -> None:
        """Init params."""
        from llama_index.core import Settings

        if isinstance(extract_prompt, str):
            extract_prompt = PromptTemplate(extract_prompt)

        super().__init__(
            llm=llm or Settings.llm,
            extract_prompt=extract_prompt or DEFAULT_KG_TRIPLET_EXTRACT_PROMPT,
            parse_fn=parse_fn,
            num_workers=num_workers,
            max_paths_per_chunk=max_paths_per_chunk,
        )

    @classmethod
    def class_name(cls) -> str:
        return "GraphExtractor"

    def __call__(
        self, nodes: List[BaseNode], show_progress: bool = False, **kwargs: Any
    ) -> List[BaseNode]:
        """Extract triples from nodes."""
        return asyncio.run(
            self.acall(nodes, show_progress=show_progress, **kwargs)
        )

    async def _aextract(self, node: BaseNode) -> BaseNode:
        """Extract triples from a node."""
        assert hasattr(node, "text")

        text = node.get_content(metadata_mode="llm")
        try:
            llm_response = await self.llm.apredict(
                self.extract_prompt,
                text=text,
                max_knowledge_triplets=self.max_paths_per_chunk,
            )
            entities, entities_relationship = self.parse_fn(llm_response)
        except ValueError:
            entities = []
            entities_relationship = []

        existing_nodes = node.metadata.pop(KG_NODES_KEY, [])
        existing_relations = node.metadata.pop(KG_RELATIONS_KEY, [])
        entity_metadata = node.metadata.copy()
        for entity, entity_type, description in entities:
            entity_metadata["entity_description"] = description
            entity_node = EntityNode(
                name=entity, label=entity_type, properties=entity_metadata
            )
            existing_nodes.append(entity_node)

        relation_metadata = node.metadata.copy()
        for triple in entities_relationship:
            subj, obj, rel, description = triple
            relation_metadata["relationship_description"] = description
            rel_node = Relation(
                label=rel,
                source_id=subj,
                target_id=obj,
                properties=relation_metadata,
            )

            existing_relations.append(rel_node)

        node.metadata[KG_NODES_KEY] = existing_nodes
        node.metadata[KG_RELATIONS_KEY] = existing_relations
        return node

    async def acall(
        self, nodes: List[BaseNode], show_progress: bool = False, **kwargs: Any
    ) -> List[BaseNode]:
        """Extract triples from nodes async."""
        jobs = []
        for node in nodes:
            jobs.append(self._aextract(node))

        return await run_jobs(
            jobs,
            workers=self.num_workers,
            show_progress=show_progress,
            desc="Extracting paths from text",
        )
import asyncio import nest_asyncio nest_asyncio.apply() from typing import Any, List, Callable, Optional, Union, Dict from IPython.display import Markdown, display from llama_index.core.async_utils import run_jobs from llama_index.core.indices.property_graph.utils import ( default_parse_triplets_fn, ) from llama_index.core.graph_stores.types import ( EntityNode, KG_NODES_KEY, KG_RELATIONS_KEY, Relation, ) from llama_index.core.llms.llm import LLM from llama_index.core.prompts import PromptTemplate from llama_index.core.prompts.default_prompts import ( DEFAULT_KG_TRIPLET_EXTRACT_PROMPT, ) from llama_index.core.schema import TransformComponent, BaseNode from llama_index.core.bridge.pydantic import BaseModel, Field class GraphRAGExtractor(TransformComponent): """从图谱中提取三元组。 使用 LLM 和简单的提示 + 输出解析从文本中提取路径（即三元组）以及实体、关系描述。 Args: llm (LLM): 要使用的语言模型。 extract_prompt (Union[str, PromptTemplate]): 用于提取三元组的提示。 parse_fn (callable): 解析语言模型输出的函数。 num_workers (int): 用于并行处理的工作线程数量。 max_paths_per_chunk (int): 每个块提取的最大路径数。 """ llm: LLM extract_prompt: PromptTemplate parse_fn: Callable num_workers: int max_paths_per_chunk: int def __init__( self, llm: Optional[LLM] = None, extract_prompt: Optional[Union[str, PromptTemplate]] = None, parse_fn: Callable = default_parse_triplets_fn, max_paths_per_chunk: int = 10, num_workers: int = 4, ) -> None: """Init params.""" from llama_index.core import Settings if isinstance(extract_prompt, str): extract_prompt = PromptTemplate(extract_prompt) super().__init__( llm=llm or Settings.llm, extract_prompt=extract_prompt or DEFAULT_KG_TRIPLET_EXTRACT_PROMPT, parse_fn=parse_fn, num_workers=num_workers, max_paths_per_chunk=max_paths_per_chunk, ) @classmethod def class_name(cls) -> str: return "GraphExtractor" def __call__( self, nodes: List[BaseNode], show_progress: bool = False, **kwargs: Any ) -> List[BaseNode]: """提取节点中的三元组。""" return asyncio.run( self.acall(nodes, show_progress=show_progress, **kwargs) ) async def _aextract(self, node: BaseNode) -> BaseNode: """提取节点中的三元组。""" assert hasattr(node, "text") text = node.get_content(metadata_mode="llm") try: llm_response = await self.llm.apredict( self.extract_prompt, text=text, max_knowledge_triplets=self.max_paths_per_chunk, ) entities, entities_relationship = self.parse_fn(llm_response) except ValueError: entities = [] entities_relationship = [] existing_nodes = node.metadata.pop(KG_NODES_KEY, []) existing_relations = node.metadata.pop(KG_RELATIONS_KEY, []) entity_metadata = node.metadata.copy() for entity, entity_type, description in entities: entity_metadata["entity_description"] = description entity_node = EntityNode( name=entity, label=entity_type, properties=entity_metadata ) existing_nodes.append(entity_node) relation_metadata = node.metadata.copy() for triple in entities_relationship: subj, obj, rel, description = triple relation_metadata["relationship_description"] = description rel_node = Relation( label=rel, source_id=subj, target_id=obj, properties=relation_metadata, ) existing_relations.append(rel_node) node.metadata[KG_NODES_KEY] = existing_nodes node.metadata[KG_RELATIONS_KEY] = existing_relations return node async def acall( self, nodes: List[BaseNode], show_progress: bool = False, **kwargs: Any ) -> List[BaseNode]: """异步提取节点中的三元组。""" jobs = [] for node in nodes: jobs.append(self._aextract(node)) return await run_jobs( jobs, workers=self.num_workers, show_progress=show_progress, desc="从文本中提取路径", )

GraphRAGStore¶

GraphRAGStore 类是 Neo4jPropertyGraphStore 类的扩展，旨在实现 GraphRAG 管道。以下是其关键组件和功能的细分：

该类使用社区检测算法对图中的相关节点进行分组，然后使用 LLM 为每个社区生成摘要。

主要方法

build_communities()

将内部图表示转换为 NetworkX 图。
应用层次化 Leiden 算法进行社区检测。
收集每个社区的详细信息。
为每个社区生成摘要。

generate_community_summary(text)

使用 LLM 生成社区关系的摘要。
摘要包括实体名称和关系描述的合成。

_create_nx_graph()

将内部图表示转换为 NetworkX 图以进行社区检测。

_collect_community_info(nx_graph, clusters)

根据社区收集每个节点的详细信息。
创建社区内每个关系的字符串表示。

_summarize_communities(community_info)

使用 LLM 为每个社区生成并存储摘要。

get_community_summaries()

返回社区摘要，如果尚未生成则先构建它们。

In [ ]

已复制！





import re
import networkx as nx
from graspologic.partition import hierarchical_leiden
from collections import defaultdict

from llama_index.core.llms import ChatMessage
from llama_index.graph_stores.neo4j import Neo4jPropertyGraphStore


class GraphRAGStore(Neo4jPropertyGraphStore):
    community_summary = {}
    entity_info = None
    max_cluster_size = 5

    def generate_community_summary(self, text):
        """Generate summary for a given text using an LLM."""
        messages = [
            ChatMessage(
                role="system",
                content=(
                    "You are provided with a set of relationships from a knowledge graph, each represented as "
                    "entity1->entity2->relation->relationship_description. Your task is to create a summary of these "
                    "relationships. The summary should include the names of the entities involved and a concise synthesis "
                    "of the relationship descriptions. The goal is to capture the most critical and relevant details that "
                    "highlight the nature and significance of each relationship. Ensure that the summary is coherent and "
                    "integrates the information in a way that emphasizes the key aspects of the relationships."
                ),
            ),
            ChatMessage(role="user", content=text),
        ]
        response = OpenAI().chat(messages)
        clean_response = re.sub(r"^assistant:\s*", "", str(response)).strip()
        return clean_response

    def build_communities(self):
        """Builds communities from the graph and summarizes them."""
        nx_graph = self._create_nx_graph()
        community_hierarchical_clusters = hierarchical_leiden(
            nx_graph, max_cluster_size=self.max_cluster_size
        )
        self.entity_info, community_info = self._collect_community_info(
            nx_graph, community_hierarchical_clusters
        )
        self._summarize_communities(community_info)

    def _create_nx_graph(self):
        """Converts internal graph representation to NetworkX graph."""
        nx_graph = nx.Graph()
        triplets = self.get_triplets()
        for entity1, relation, entity2 in triplets:
            nx_graph.add_node(entity1.name)
            nx_graph.add_node(entity2.name)
            nx_graph.add_edge(
                relation.source_id,
                relation.target_id,
                relationship=relation.label,
                description=relation.properties["relationship_description"],
            )
        return nx_graph

    def _collect_community_info(self, nx_graph, clusters):
        """
        Collect information for each node based on their community,
        allowing entities to belong to multiple clusters.
        """
        entity_info = defaultdict(set)
        community_info = defaultdict(list)

        for item in clusters:
            node = item.node
            cluster_id = item.cluster

            # Update entity_info
            entity_info[node].add(cluster_id)

            for neighbor in nx_graph.neighbors(node):
                edge_data = nx_graph.get_edge_data(node, neighbor)
                if edge_data:
                    detail = f"{node} -> {neighbor} -> {edge_data['relationship']} -> {edge_data['description']}"
                    community_info[cluster_id].append(detail)

        # Convert sets to lists for easier serialization if needed
        entity_info = {k: list(v) for k, v in entity_info.items()}

        return dict(entity_info), dict(community_info)

    def _summarize_communities(self, community_info):
        """Generate and store summaries for each community."""
        for community_id, details in community_info.items():
            details_text = (
                "\n".join(details) + "."
            )  # Ensure it ends with a period
            self.community_summary[
                community_id
            ] = self.generate_community_summary(details_text)

    def get_community_summaries(self):
        """Returns the community summaries, building them if not already done."""
        if not self.community_summary:
            self.build_communities()
        return self.community_summary
import re import networkx as nx from graspologic.partition import hierarchical_leiden from collections import defaultdict from llama_index.core.llms import ChatMessage from llama_index.graph_stores.neo4j import Neo4jPropertyGraphStore class GraphRAGStore(Neo4jPropertyGraphStore): community_summary = {} entity_info = None max_cluster_size = 5 def generate_community_summary(self, text): """使用 LLM 为给定文本生成摘要。""" messages = [ ChatMessage( role="system", content=( "你将获得一组来自知识图谱的关系，每条关系表示为 " "entity1->entity2->relation->relationship_description。你的任务是创建这些关系的摘要。摘要应包含涉及的实体名称以及关系描述的简洁综合。目标是捕捉最关键和相关联的细节，突出每种关系的本质和重要性。确保摘要连贯一致，并以强调关系关键方面的方式整合信息。" ), ), ChatMessage(role="user", content=text), ] response = OpenAI().chat(messages) clean_response = re.sub(r"^assistant:\s*", "", str(response)).strip() return clean_response def build_communities(self): """从图中构建社区并进行摘要。""" nx_graph = self._create_nx_graph() community_hierarchical_clusters = hierarchical_leiden( nx_graph, max_cluster_size=self.max_cluster_size ) self.entity_info, community_info = self._collect_community_info( nx_graph, community_hierarchical_clusters ) self._summarize_communities(community_info) def _create_nx_graph(self): """将内部图表示转换为 NetworkX 图。""" nx_graph = nx.Graph() triplets = self.get_triplets() for entity1, relation, entity2 in triplets: nx_graph.add_node(entity1.name) nx_graph.add_node(entity2.name) nx_graph.add_edge( relation.source_id, relation.target_id, relationship=relation.label, description=relation.properties["relationship_description"], ) return nx_graph def _collect_community_info(self, nx_graph, clusters): """ 收集每个节点基于其社区的信息，允许实体属于多个集群。 """ entity_info = defaultdict(set) community_info = defaultdict(list) for item in clusters: node = item.node cluster_id = item.cluster # Update entity_info entity_info[node].add(cluster_id) for neighbor in nx_graph.neighbors(node): edge_data = nx_graph.get_edge_data(node, neighbor) if edge_data: detail = f"{node} -> {neighbor} -> {edge_data['relationship']} -> {edge_data['description']}" community_info[cluster_id].append(detail) # Convert sets to lists for easier serialization if needed entity_info = {k: list(v) for k, v in entity_info.items()} return dict(entity_info), dict(community_info) def _summarize_communities(self, community_info): """为每个社区生成并存储摘要。""" for community_id, details in community_info.items(): details_text = ( "\n".join(details) + "." ) # Ensure it ends with a period self.community_summary[ community_id ] = self.generate_community_summary(details_text) def get_community_summaries(self): """返回社区摘要，如果尚未生成则先构建它们。""" if not self.community_summary: self.build_communities() return self.community_summary

GraphRAGQueryEngine¶

GraphRAGQueryEngine 类是一个自定义查询引擎，旨在利用 GraphRAG 方法处理查询。它利用 GraphRAGStore 生成的社区摘要来回答用户查询。以下是其功能的详细说明：

主要组成部分

graph_store: GraphRAGStore 的一个实例，包含社区摘要。 llm: 用于生成和汇总答案的语言模型（LLM）。

主要方法

custom_query(query_str: str)

这是处理查询的主要入口点。它检索社区摘要，从每个摘要生成答案，然后将这些答案汇总成最终响应。

generate_answer_from_summary(community_summary, query)

根据单个社区摘要为查询生成答案。使用 LLM 在查询上下文中解释社区摘要。

aggregate_answers(community_answers)

将来自不同社区的独立答案组合成一个连贯的最终响应。
使用 LLM 将多个视角综合成一个简洁的答案。

查询处理流程

从图存储中检索社区摘要。
对于每个社区摘要，生成对查询的具体答案。
将所有社区特定的答案汇总成一个最终的、连贯的响应。

示例用法

query_engine = GraphRAGQueryEngine(graph_store=graph_store, llm=llm)

response = query_engine.query("query")

In [ ]

已复制！





from llama_index.core.query_engine import CustomQueryEngine
from llama_index.core.llms import LLM
from llama_index.core import PropertyGraphIndex

import re


class GraphRAGQueryEngine(CustomQueryEngine):
    graph_store: GraphRAGStore
    index: PropertyGraphIndex
    llm: LLM
    similarity_top_k: int = 20

    def custom_query(self, query_str: str) -> str:
        """Process all community summaries to generate answers to a specific query."""

        entities = self.get_entities(query_str, self.similarity_top_k)

        community_ids = self.retrieve_entity_communities(
            self.graph_store.entity_info, entities
        )
        community_summaries = self.graph_store.get_community_summaries()
        community_answers = [
            self.generate_answer_from_summary(community_summary, query_str)
            for id, community_summary in community_summaries.items()
            if id in community_ids
        ]

        final_answer = self.aggregate_answers(community_answers)
        return final_answer

    def get_entities(self, query_str, similarity_top_k):
        nodes_retrieved = self.index.as_retriever(
            similarity_top_k=similarity_top_k
        ).retrieve(query_str)

        enitites = set()
        pattern = (
            r"^(\w+(?:\s+\w+)*)\s*->\s*([a-zA-Z\s]+?)\s*->\s*(\w+(?:\s+\w+)*)$"
        )

        for node in nodes_retrieved:
            matches = re.findall(
                pattern, node.text, re.MULTILINE | re.IGNORECASE
            )

            for match in matches:
                subject = match[0]
                obj = match[2]
                enitites.add(subject)
                enitites.add(obj)

        return list(enitites)

    def retrieve_entity_communities(self, entity_info, entities):
        """
        Retrieve cluster information for given entities, allowing for multiple clusters per entity.

        Args:
        entity_info (dict): Dictionary mapping entities to their cluster IDs (list).
        entities (list): List of entity names to retrieve information for.

        Returns:
        List of community or cluster IDs to which an entity belongs.
        """
        community_ids = []

        for entity in entities:
            if entity in entity_info:
                community_ids.extend(entity_info[entity])

        return list(set(community_ids))

    def generate_answer_from_summary(self, community_summary, query):
        """Generate an answer from a community summary based on a given query using LLM."""
        prompt = (
            f"Given the community summary: {community_summary}, "
            f"how would you answer the following query? Query: {query}"
        )
        messages = [
            ChatMessage(role="system", content=prompt),
            ChatMessage(
                role="user",
                content="I need an answer based on the above information.",
            ),
        ]
        response = self.llm.chat(messages)
        cleaned_response = re.sub(r"^assistant:\s*", "", str(response)).strip()
        return cleaned_response

    def aggregate_answers(self, community_answers):
        """Aggregate individual community answers into a final, coherent response."""
        # intermediate_text = " ".join(community_answers)
        prompt = "Combine the following intermediate answers into a final, concise response."
        messages = [
            ChatMessage(role="system", content=prompt),
            ChatMessage(
                role="user",
                content=f"Intermediate answers: {community_answers}",
            ),
        ]
        final_response = self.llm.chat(messages)
        cleaned_final_response = re.sub(
            r"^assistant:\s*", "", str(final_response)
        ).strip()
        return cleaned_final_response
from llama_index.core.query_engine import CustomQueryEngine from llama_index.core.llms import LLM from llama_index.core import PropertyGraphIndex import re class GraphRAGQueryEngine(CustomQueryEngine): graph_store: GraphRAGStore index: PropertyGraphIndex llm: LLM similarity_top_k: int = 20 def custom_query(self, query_str: str) -> str: """处理所有社区摘要，为特定查询生成答案。""" entities = self.get_entities(query_str, self.similarity_top_k) community_ids = self.retrieve_entity_communities( self.graph_store.entity_info, entities ) community_summaries = self.graph_store.get_community_summaries() community_answers = [ self.generate_answer_from_summary(community_summary, query_str) for id, community_summary in community_summaries.items() if id in community_ids ] final_answer = self.aggregate_answers(community_answers) return final_answer def get_entities(self, query_str, similarity_top_k): nodes_retrieved = self.index.as_retriever( similarity_top_k=similarity_top_k ).retrieve(query_str) enitites = set() pattern = ( r"^(\w+(?:\s+\w+)*)\s*->\s*([a-zA-Z\s]+?)\s*->\s*(\w+(?:\s+\w+)*)$" ) for node in nodes_retrieved: matches = re.findall( pattern, node.text, re.MULTILINE | re.IGNORECASE ) for match in matches: subject = match[0] obj = match[2] enitites.add(subject) enitites.add(obj) return list(enitites) def retrieve_entity_communities(self, entity_info, entities): """ 检索给定实体的集群信息，允许每个实体属于多个集群。 Args: entity_info (dict): 将实体映射到其集群 ID 列表的字典。 entities (list): 需要检索信息的实体名称列表。 Returns: 实体所属的社区或集群 ID 列表。 """ community_ids = [] for entity in entities: if entity in entity_info: community_ids.extend(entity_info[entity]) return list(set(community_ids)) def generate_answer_from_summary(self, community_summary, query): """使用 LLM 根据给定查询从社区摘要生成答案。""" prompt = ( f"Given the community summary: {community_summary}, " f"how would you answer the following query? Query: {query}" ) messages = [ ChatMessage(role="system", content=prompt), ChatMessage( role="user", content="I need an answer based on the above information.", ), ] response = self.llm.chat(messages) cleaned_response = re.sub(r"^assistant:\s*", "", str(response)).strip() return cleaned_response def aggregate_answers(self, community_answers): """将独立的社区答案汇总成一个最终的、连贯的响应。""" # intermediate_text = " ".join(community_answers) prompt = "Combine the following intermediate answers into a final, concise response." messages = [ ChatMessage(role="system", content=prompt), ChatMessage( role="user", content=f"Intermediate answers: {community_answers}", ), ] final_response = self.llm.chat(messages) cleaned_final_response = re.sub( r"^assistant:\s*", "", str(final_response) ).strip() return cleaned_final_response

构建端到端 GraphRAG 管道¶

现在我们已经定义了所有必需的组件，接下来构建 GraphRAG 管道

从文本创建节点/块。
使用 GraphRAGExtractor 和 GraphRAGStore 构建 PropertyGraphIndex。
使用上面构建的图构建社区并为每个社区生成摘要。
创建一个 GraphRAGQueryEngine 并开始查询。

从文本创建节点/块。¶

In [ ]

已复制！

from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(
    chunk_size=1024,
    chunk_overlap=20,
)
nodes = splitter.get_nodes_from_documents(documents)
from llama_index.core.node_parser import SentenceSplitter splitter = SentenceSplitter( chunk_size=1024, chunk_overlap=20, ) nodes = splitter.get_nodes_from_documents(documents)

In [ ]

已复制！

len(nodes)
len(nodes)

Out[ ]

使用 `GraphRAGExtractor` 和 `GraphRAGStore` 构建 PropertyGraphIndex¶

In [ ]

已复制！





KG_TRIPLET_EXTRACT_TMPL = """
-Goal-
Given a text document, identify all entities and their entity types from the text and all relationships among the identified entities.
Given the text, extract up to {max_knowledge_triplets} entity-relation triplets.

-Steps-
1. Identify all entities. For each identified entity, extract the following information:
- entity_name: Name of the entity, capitalized
- entity_type: Type of the entity
- entity_description: Comprehensive description of the entity's attributes and activities

2. From the entities identified in step 1, identify all pairs of (source_entity, target_entity) that are *clearly related* to each other.
For each pair of related entities, extract the following information:
- source_entity: name of the source entity, as identified in step 1
- target_entity: name of the target entity, as identified in step 1
- relation: relationship between source_entity and target_entity
- relationship_description: explanation as to why you think the source entity and the target entity are related to each other

3. Output Formatting:
- Return the result in valid JSON format with two keys: 'entities' (list of entity objects) and 'relationships' (list of relationship objects).
- Exclude any text outside the JSON structure (e.g., no explanations or comments).
- If no entities or relationships are identified, return empty lists: { "entities": [], "relationships": [] }.

-An Output Example-
{
  "entities": [
    {
      "entity_name": "Albert Einstein",
      "entity_type": "Person",
      "entity_description": "Albert Einstein was a theoretical physicist who developed the theory of relativity and made significant contributions to physics."
    },
    {
      "entity_name": "Theory of Relativity",
      "entity_type": "Scientific Theory",
      "entity_description": "A scientific theory developed by Albert Einstein, describing the laws of physics in relation to observers in different frames of reference."
    },
    {
      "entity_name": "Nobel Prize in Physics",
      "entity_type": "Award",
      "entity_description": "A prestigious international award in the field of physics, awarded annually by the Royal Swedish Academy of Sciences."
    }
  ],
  "relationships": [
    {
      "source_entity": "Albert Einstein",
      "target_entity": "Theory of Relativity",
      "relation": "developed",
      "relationship_description": "Albert Einstein is the developer of the theory of relativity."
    },
    {
      "source_entity": "Albert Einstein",
      "target_entity": "Nobel Prize in Physics",
      "relation": "won",
      "relationship_description": "Albert Einstein won the Nobel Prize in Physics in 1921."
    }
  ]
}

-Real Data-
######################
text: {text}
######################
output:"""
KG_TRIPLET_EXTRACT_TMPL = """ -目标- 给定一个文本文档，从文本中识别所有实体及其实体类型，以及所有已识别实体之间的关系。给定文本，最多提取 {max_knowledge_triplets} 个实体-关系三元组。 -步骤- 1. 识别所有实体。对于每个已识别的实体，提取以下信息：- entity_name: 实体的名称，首字母大写 - entity_type: 实体的类型 - entity_description: 对实体属性和活动的全面描述 2. 从步骤 1 中识别的实体中，识别所有 *明显相关* 的 (source_entity, target_entity) 对。对于每对相关实体，提取以下信息：- source_entity: 源实体的名称，与步骤 1 中识别的一致 - target_entity: 目标实体的名称，与步骤 1 中识别的一致 - relation: source_entity 和 target_entity 之间的关系 - relationship_description: 解释为何认为源实体和目标实体相互关联 3. 输出格式：- 以有效的 JSON 格式返回结果，包含两个键：'entities'（实体对象列表）和 'relationships'（关系对象列表）。- 排除 JSON 结构之外的任何文本（例如，无解释或注释）。- 如果未识别到实体或关系，返回空列表：{ "entities": [], "relationships": [] }。 -输出示例- { "entities": [ { "entity_name": "Albert Einstein", "entity_type": "Person", "entity_description": "Albert Einstein was a theoretical physicist who developed the theory of relativity and made significant contributions to physics." }, { "entity_name": "Theory of Relativity", "entity_type": "Scientific Theory", "entity_description": "A scientific theory developed by Albert Einstein, describing the laws of physics in relation to observers in different frames of reference." }, { "entity_name": "Nobel Prize in Physics", "entity_type": "Award", "entity_description": "A prestigious international award in the field of physics, awarded annually by the Royal Swedish Academy of Sciences." } ], "relationships": [ { "source_entity": "Albert Einstein", "target_entity": "Theory of Relativity", "relation": "developed", "relationship_description": "Albert Einstein is the developer of the theory of relativity." }, { "source_entity": "Albert Einstein", "target_entity": "Nobel Prize in Physics", "relation": "won", "relationship_description": "Albert Einstein won the Nobel Prize in Physics in 1921." } ] } -真实数据- ###################### text: {text} ###################### output:"""

In [ ]

已复制！





import json


def parse_fn(response_str: str) -> Any:
    json_pattern = r"\{.*\}"
    match = re.search(json_pattern, response_str, re.DOTALL)
    entities = []
    relationships = []
    if not match:
        return entities, relationships
    json_str = match.group(0)
    try:
        data = json.loads(json_str)
        entities = [
            (
                entity["entity_name"],
                entity["entity_type"],
                entity["entity_description"],
            )
            for entity in data.get("entities", [])
        ]
        relationships = [
            (
                relation["source_entity"],
                relation["target_entity"],
                relation["relation"],
                relation["relationship_description"],
            )
            for relation in data.get("relationships", [])
        ]
        return entities, relationships
    except json.JSONDecodeError as e:
        print("Error parsing JSON:", e)
        return entities, relationships


kg_extractor = GraphRAGExtractor(
    llm=llm,
    extract_prompt=KG_TRIPLET_EXTRACT_TMPL,
    max_paths_per_chunk=2,
    parse_fn=parse_fn,
)
import json def parse_fn(response_str: str) -> Any: json_pattern = r"\{.*\}" match = re.search(json_pattern, response_str, re.DOTALL) entities = [] relationships = [] if not match: return entities, relationships json_str = match.group(0) try: data = json.loads(json_str) entities = [ ( entity["entity_name"], entity["entity_type"], entity["entity_description"], ) for entity in data.get("entities", []) ] relationships = [ ( relation["source_entity"], relation["target_entity"], relation["relation"], relation["relationship_description"], ) for relation in data.get("relationships", []) ] return entities, relationships except json.JSONDecodeError as e: print("解析 JSON 出错:", e) return entities, relationships kg_extractor = GraphRAGExtractor( llm=llm, extract_prompt=KG_TRIPLET_EXTRACT_TMPL, max_paths_per_chunk=2, parse_fn=parse_fn, )

Docker 和 Neo4J 设置¶

要在本地启动 Neo4j，首先确保已安装 docker。然后，可以使用以下 docker 命令启动数据库。

docker run \
    -p 7474:7474 -p 7687:7687 \
    -v $PWD/data:/data -v $PWD/plugins:/plugins \
    --name neo4j-apoc \
    -e NEO4J_apoc_export_file_enabled=true \
    -e NEO4J_apoc_import_file_enabled=true \
    -e NEO4J_apoc_import_file_use__neo4j__config=true \
    -e NEO4JLABS_PLUGINS=\[\"apoc\"\] \
    neo4j:latest

在此处，您可以访问数据库：http://localhost:7474/。在此页面上，系统会要求您登录。使用默认用户名/密码 neo4j 和 neo4j。

首次登录后，系统会要求您更改密码。

In [ ]

已复制！

from llama_index.graph_stores.neo4j import Neo4jPropertyGraphStore

# Note: used to be `Neo4jPGStore`
graph_store = GraphRAGStore(
    username="neo4j", password="<PASSWORD>", url="bolt://localhost:7687"
)
from llama_index.graph_stores.neo4j import Neo4jPropertyGraphStore # 注意：以前是 Neo4jPGStore graph_store = GraphRAGStore( username="neo4j", password="", url="bolt://localhost:7687" )

Received notification from DBMS server: {severity: WARNING} {code: Neo.ClientNotification.Statement.FeatureDeprecationWarning} {category: DEPRECATION} {title: This feature is deprecated and will be removed in future versions.} {description: The procedure has a deprecated field. ('config' used by 'apoc.meta.graphSample' is deprecated.)} {position: line: 1, column: 1, offset: 0} for query: "CALL apoc.meta.graphSample() YIELD nodes, relationships RETURN nodes, [rel in relationships | {name:apoc.any.property(rel, 'type'), count: apoc.any.property(rel, 'count')}] AS relationships"

In [ ]

已复制！





from llama_index.core import PropertyGraphIndex

index = PropertyGraphIndex(
    nodes=nodes,
    kg_extractors=[kg_extractor],
    property_graph_store=graph_store,
    show_progress=True,
)
from llama_index.core import PropertyGraphIndex index = PropertyGraphIndex( nodes=nodes, kg_extractors=[kg_extractor], property_graph_store=graph_store, show_progress=True, )

Extracting paths from text: 100%|██████████| 50/50 [05:45<00:00,  6.90s/it]
Generating embeddings: 100%|██████████| 1/1 [00:02<00:00,  2.59s/it]
Generating embeddings: 100%|██████████| 2/2 [00:03<00:00,  1.90s/it]
Received notification from DBMS server: {severity: WARNING} {code: Neo.ClientNotification.Statement.FeatureDeprecationWarning} {category: DEPRECATION} {title: This feature is deprecated and will be removed in future versions.} {description: The procedure has a deprecated field. ('config' used by 'apoc.meta.graphSample' is deprecated.)} {position: line: 1, column: 1, offset: 0} for query: "CALL apoc.meta.graphSample() YIELD nodes, relationships RETURN nodes, [rel in relationships | {name:apoc.any.property(rel, 'type'), count: apoc.any.property(rel, 'count')}] AS relationships"

In [ ]

已复制！

index.property_graph_store.get_triplets()[10]
index.property_graph_store.get_triplets()[10]

Out[ ]

[EntityNode(label='Software', embedding=None, properties={'id': 'Unreal Engine', 'entity_description': "Unreal Engine is a game engine developed by Epic. It is used in conjunction with Epic's MetaHuman Animator tool to animate hyperrealistic MetaHumans.", 'triplet_source_id': 'b6fbbdc0-cc13-4342-a70e-b0d86f3fd2ad'}, name='MetaHuman Animator'),
 Relation(label='Integrated', source_id='MetaHuman Animator', target_id='Unreal Engine', properties={'relationship_description': 'The MetaHuman Animator tool developed by Epic is integrated with the Unreal Engine. It applies the captured actor’s facial performance to a hyperrealistic “MetaHuman” in the Unreal Engine.', 'triplet_source_id': 'a6f5c123-65a8-4278-8e24-e103e767b82f'}),
 EntityNode(label='Software', embedding=None, properties={'id': 'MetaHuman Animator', 'entity_description': 'MetaHuman Animator is a tool developed by Epic that captures an actor’s facial performance using a device as simple as an iPhone and applies it to a MetaHuman in the Unreal Engine. It is designed to produce results quickly and efficiently.', 'triplet_source_id': 'b6fbbdc0-cc13-4342-a70e-b0d86f3fd2ad'}, name='Unreal Engine')]

In [ ]

已复制！

index.property_graph_store.get_triplets()[10][0].properties
index.property_graph_store.get_triplets()[10][0].properties

Out[ ]

{'id': 'Unreal Engine',
 'entity_description': "Unreal Engine is a game engine developed by Epic. It is used in conjunction with Epic's MetaHuman Animator tool to animate hyperrealistic MetaHumans.",
 'triplet_source_id': 'b6fbbdc0-cc13-4342-a70e-b0d86f3fd2ad'}

In [ ]

已复制！

index.property_graph_store.get_triplets()[10][1].properties
index.property_graph_store.get_triplets()[10][1].properties

Out[ ]

{'relationship_description': 'The MetaHuman Animator tool developed by Epic is integrated with the Unreal Engine. It applies the captured actor’s facial performance to a hyperrealistic “MetaHuman” in the Unreal Engine.',
 'triplet_source_id': 'a6f5c123-65a8-4278-8e24-e103e767b82f'}

构建社区¶

这将为每个社区创建社区和摘要。

In [ ]

已复制！

index.property_graph_store.build_communities()
index.property_graph_store.build_communities()

创建 QueryEngine¶

In [ ]

已复制！





query_engine = GraphRAGQueryEngine(
    graph_store=index.property_graph_store,
    llm=llm,
    index=index,
    similarity_top_k=10,
)
query_engine = GraphRAGQueryEngine( graph_store=index.property_graph_store, llm=llm, index=index, similarity_top_k=10, )

查询¶

In [ ]

已复制！

response = query_engine.query(
    "What are the main news discussed in the document?"
)
display(Markdown(f"{response.response}"))
response = query_engine.query( "文档中讨论的主要新闻是什么？" ) display(Markdown(f"{response.response}"))

文档讨论了几项重要的商业新闻：FirstEnergy 的盈利业绩，Tram Nguyen 被任命为美国银行战略和可持续投资全球主管，摩根士丹利聘请 Thomas Christl 与 Imran Ansari 共同领导其欧洲消费和零售客户业务，以及 COVID-19 大流行对达美航空和西南航空的重大影响，包括暂停和恢复其股息支付。

In [ ]

已复制！

response = query_engine.query("What are the main news in energy sector?")
display(Markdown(f"{response.response}"))
response = query_engine.query("能源领域的主要新闻是什么？") display(Markdown(f"{response.response}"))

能源领域的主要新闻是 GE Vernova 和 Amplus Solar 已建立供应商-客户关系。Amplus Solar 选择 GE Vernova 为一个 108 兆瓦的风电项目提供并安装 40 台 2.7-132 型陆上风力涡轮机。这意味着 GE Vernova 将为项目的成功执行提供必要的设备和服务。

GraphRAG 在 LlamaIndex 中的实现 - V2¶

安装¶

加载数据¶

设置 API 密钥和 LLM¶

GraphRAGExtractor¶

GraphRAGStore¶

GraphRAGQueryEngine¶

构建端到端 GraphRAG 管道¶

从文本创建节点/块。¶

使用 GraphRAGExtractor 和 GraphRAGStore 构建 PropertyGraphIndex¶

Docker 和 Neo4J 设置¶

构建社区¶

创建 QueryEngine¶

查询¶

使用 `GraphRAGExtractor` 和 `GraphRAGStore` 构建 PropertyGraphIndex¶