属性图索引¶
在本 Notebook 中,我们将演示 LlamaIndex 中 PropertyGraphIndex
的一些基本用法。
此处的属性图索引将获取非结构化文档,从中提取属性图,并提供各种方法来查询该图。
输入 [ ]
已复制!
%pip install llama-index
%pip install llama-index
设置¶
输入 [ ]
已复制!
import os
os.environ["OPENAI_API_KEY"] = "sk-proj-..."
import os os.environ["OPENAI_API_KEY"] = "sk-proj-..."
输入 [ ]
已复制!
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
!mkdir -p 'data/paul_graham/' !wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
输入 [ ]
已复制!
import nest_asyncio
nest_asyncio.apply()
import nest_asyncio nest_asyncio.apply()
输入 [ ]
已复制!
from llama_index.core import SimpleDirectoryReader
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
from llama_index.core import SimpleDirectoryReader documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
构建¶
输入 [ ]
已复制!
from llama_index.core import PropertyGraphIndex
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
index = PropertyGraphIndex.from_documents(
documents,
llm=OpenAI(model="gpt-3.5-turbo", temperature=0.3),
embed_model=OpenAIEmbedding(model_name="text-embedding-3-small"),
show_progress=True,
)
from llama_index.core import PropertyGraphIndex from llama_index.embeddings.openai import OpenAIEmbedding from llama_index.llms.openai import OpenAI index = PropertyGraphIndex.from_documents( documents, llm=OpenAI(model="gpt-3.5-turbo", temperature=0.3), embed_model=OpenAIEmbedding(model_name="text-embedding-3-small"), show_progress=True, )
/Users/loganmarkewich/Library/Caches/pypoetry/virtualenvs/llama-index-bXUwlEfH-py3.11/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 25.46it/s] Extracting paths from text: 100%|██████████| 22/22 [00:12<00:00, 1.72it/s] Extracting implicit paths: 100%|██████████| 22/22 [00:00<00:00, 36186.15it/s] Generating embeddings: 100%|██████████| 1/1 [00:00<00:00, 1.14it/s] Generating embeddings: 100%|██████████| 5/5 [00:00<00:00, 5.43it/s]
让我们回顾一下刚刚发生了什么
PropertyGraphIndex.from_documents()
- 我们将文档加载到索引中Parsing nodes
- 索引将文档解析为节点Extracting paths from text
- 节点被传递给 LLM,LLM 被提示生成知识图三元组(即路径)Extracting implicit paths
- 每个node.relationships
属性都被用来推断隐式路径Generating embeddings
- 为每个文本节点和图节点生成嵌入(因此此步骤会发生两次)
让我们探索一下我们创建的内容!为了调试目的,默认的 SimplePropertyGraphStore
包含一个辅助函数,可以将图的 networkx
表示形式保存到 html
文件中。
输入 [ ]
已复制!
index.property_graph_store.save_networkx_graph(name="./kg.html")
index.property_graph_store.save_networkx_graph(name="./kg.html")
在浏览器中打开 html 文件,我们就可以看到我们的图了!
如果放大,每个连接数较多的“密集”节点实际上是源块,提取的实体和关系从此分支出来。
定制低级构建¶
如果需要,我们可以使用低级 API 进行相同的摄取,利用 kg_extractors
。
输入 [ ]
已复制!
from llama_index.core.indices.property_graph import (
ImplicitPathExtractor,
SimpleLLMPathExtractor,
)
index = PropertyGraphIndex.from_documents(
documents,
embed_model=OpenAIEmbedding(model_name="text-embedding-3-small"),
kg_extractors=[
ImplicitPathExtractor(),
SimpleLLMPathExtractor(
llm=OpenAI(model="gpt-3.5-turbo", temperature=0.3),
num_workers=4,
max_paths_per_chunk=10,
),
],
show_progress=True,
)
from llama_index.core.indices.property_graph import ( ImplicitPathExtractor, SimpleLLMPathExtractor, ) index = PropertyGraphIndex.from_documents( documents, embed_model=OpenAIEmbedding(model_name="text-embedding-3-small"), kg_extractors=[ ImplicitPathExtractor(), SimpleLLMPathExtractor( llm=OpenAI(model="gpt-3.5-turbo", temperature=0.3), num_workers=4, max_paths_per_chunk=10, ), ], show_progress=True, )
有关所有提取器的完整指南,请参阅详细用法页面。
查询¶
查询属性图索引通常包括使用一个或多个子检索器并组合结果。
图检索可以被认为是
- 选择节点
- 从这些节点遍历
默认情况下,同时使用两种类型的检索
- 同义词/关键词扩展 - 使用 LLM 从查询生成同义词和关键词
- 向量检索 - 使用嵌入在图中查找节点
找到节点后,您可以选择:
- 返回与所选节点相邻的路径(即三元组)
- 返回路径 + 块的原始源文本(如果可用)
输入 [ ]
已复制!
retriever = index.as_retriever(
include_text=False, # include source text, default True
)
nodes = retriever.retrieve("What happened at Interleaf and Viaweb?")
for node in nodes:
print(node.text)
retriever = index.as_retriever( include_text=False, # include source text, default True ) nodes = retriever.retrieve("What happened at Interleaf and Viaweb?") for node in nodes: print(node.text)
Interleaf -> Was -> On the way down Viaweb -> Had -> Code editor Interleaf -> Built -> Impressive technology Interleaf -> Added -> Scripting language Interleaf -> Made -> Scripting language Viaweb -> Suggested -> Take to hospital Interleaf -> Had done -> Something bold Viaweb -> Called -> After Interleaf -> Made -> Dialect of lisp Interleaf -> Got crushed by -> Moore's law Dan giffin -> Worked for -> Viaweb Interleaf -> Had -> Smart people Interleaf -> Had -> Few years to live Interleaf -> Made -> Software Interleaf -> Made -> Software for creating documents Paul graham -> Started -> Viaweb Scripting language -> Was -> Dialect of lisp Scripting language -> Is -> Dialect of lisp Software -> Will be affected by -> Rapid change Code editor -> Was -> In viaweb Software -> Worked via -> Web Programs -> Typed on -> Punch cards Computers -> Skipped -> Step Idea -> Was clear from -> Experience Apartment -> Wasn't -> Rent-controlled
输入 [ ]
已复制!
query_engine = index.as_query_engine(
include_text=True,
)
response = query_engine.query("What happened at Interleaf and Viaweb?")
print(str(response))
query_engine = index.as_query_engine( include_text=True, ) response = query_engine.query("What happened at Interleaf and Viaweb?") print(str(response))
Interleaf had smart people and built impressive technology, including adding a scripting language that was a dialect of Lisp. However, despite their efforts, they were eventually impacted by Moore's Law and faced challenges. Viaweb, on the other hand, was started by Paul Graham and had a code editor where users could define their own page styles using Lisp expressions. Viaweb also suggested taking someone to the hospital and called something "After."
有关定制检索和查询的完整详细信息,请参阅文档页面。
存储¶
默认情况下,存储使用我们简单的内存抽象——用于嵌入的 SimpleVectorStore
和用于属性图的 SimplePropertyGraphStore
。
我们可以将它们保存到磁盘或从磁盘加载。
输入 [ ]
已复制!
index.storage_context.persist(persist_dir="./storage")
from llama_index.core import StorageContext, load_index_from_storage
index = load_index_from_storage(
StorageContext.from_defaults(persist_dir="./storage")
)
index.storage_context.persist(persist_dir="./storage") from llama_index.core import StorageContext, load_index_from_storage index = load_index_from_storage( StorageContext.from_defaults(persist_dir="./storage") )
向量存储¶
虽然一些图数据库支持向量(例如 Neo4j),但在不支持向量或您想要覆盖的情况下,您仍然可以指定在图上使用的向量存储。
下面我们将把 ChromaVectorStore
与默认的 SimplePropertyGraphStore
结合使用。
输入 [ ]
已复制!
%pip install llama-index-vector-stores-chroma
%pip install llama-index-vector-stores-chroma
输入 [ ]
已复制!
from llama_index.core.graph_stores import SimplePropertyGraphStore
from llama_index.vector_stores.chroma import ChromaVectorStore
import chromadb
client = chromadb.PersistentClient("./chroma_db")
collection = client.get_or_create_collection("my_graph_vector_db")
index = PropertyGraphIndex.from_documents(
documents,
embed_model=OpenAIEmbedding(model_name="text-embedding-3-small"),
graph_store=SimplePropertyGraphStore(),
vector_store=ChromaVectorStore(collection=collection),
show_progress=True,
)
index.storage_context.persist(persist_dir="./storage")
from llama_index.core.graph_stores import SimplePropertyGraphStore from llama_index.vector_stores.chroma import ChromaVectorStore import chromadb client = chromadb.PersistentClient("./chroma_db") collection = client.get_or_create_collection("my_graph_vector_db") index = PropertyGraphIndex.from_documents( documents, embed_model=OpenAIEmbedding(model_name="text-embedding-3-small"), graph_store=SimplePropertyGraphStore(), vector_store=ChromaVectorStore(collection=collection), show_progress=True, ) index.storage_context.persist(persist_dir="./storage")
然后加载
输入 [ ]
已复制!
index = PropertyGraphIndex.from_existing(
SimplePropertyGraphStore.from_persist_dir("./storage"),
vector_store=ChromaVectorStore(chroma_collection=collection),
llm=OpenAI(model="gpt-3.5-turbo", temperature=0.3),
)
index = PropertyGraphIndex.from_existing( SimplePropertyGraphStore.from_persist_dir("./storage"), vector_store=ChromaVectorStore(chroma_collection=collection), llm=OpenAI(model="gpt-3.5-turbo", temperature=0.3), )
这与纯粹使用存储上下文略有不同,但现在我们将它们混合在一起后,语法更简洁。