Gel 向量存储¶
Gel 是一个开源的 PostgreSQL 数据层,针对快速开发到生产周期进行了优化。它提供了一个高级的严格类型化的图状数据模型、可组合的层次化查询语言、完整的 SQL 支持、迁移、Auth 和 AI 模块。
如果您在 Colab 上打开此 Notebook,您可能需要安装 LlamaIndex 🦙。
In [ ]
已复制!
! pip install gel llama-index-vector-stores-gel
! pip install gel llama-index-vector-stores-gel
In [ ]
已复制!
! pip install llama-index
! pip install llama-index
In [ ]
已复制!
# import logging
# import sys
# Uncomment to see debug logs
# logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
# logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
from llama_index.core import SimpleDirectoryReader, StorageContext
from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.gel import GelVectorStore
import textwrap
import openai
# import logging # import sys # Uncomment to see debug logs # logging.basicConfig(stream=sys.stdout, level=logging.DEBUG) # logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout)) from llama_index.core import SimpleDirectoryReader, StorageContext from llama_index.core import VectorStoreIndex from llama_index.vector_stores.gel import GelVectorStore import textwrap import openai
设置 OpenAI¶
第一步是配置 openai 密钥。它将用于为加载到索引中的文档创建嵌入。
In [ ]
已复制!
import os
os.environ["OPENAI_API_KEY"] = "<your key>"
openai.api_key = os.environ["OPENAI_API_KEY"]
import os os.environ["OPENAI_API_KEY"] = "" openai.api_key = os.environ["OPENAI_API_KEY"]
下载数据
In [ ]
已复制!
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
!mkdir -p 'data/paul_graham/' !wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
加载文档¶
使用 SimpleDirectoryReader 加载存储在 data/paul_graham/
中的文档
In [ ]
已复制!
documents = SimpleDirectoryReader("./data/paul_graham").load_data()
print("Document ID:", documents[0].doc_id)
documents = SimpleDirectoryReader("./data/paul_graham").load_data() print("Document ID:", documents[0].doc_id)
In [ ]
已复制!
! gel project init --non-interactive
! gel project init --non-interactive
如果您使用的是 Gel Cloud(建议使用!),则在该命令中再添加一个参数
gel project init --server-instance <org-name>/<instance-name>
有关运行 Gel 的完整方法列表,请参阅参考文档的 运行 Gel 部分。
设置 schema¶
Gel schema 是对您应用程序数据模型的显式高级描述。除了使您能够精确定义数据布局外,它还驱动着 Gel 的许多强大功能,例如链接、访问策略、函数、触发器、约束、索引等。
LlamaIndex 的 GelVectorStore
期望 schema 采用以下布局
In [ ]
已复制!
schema_content = """
using extension pgvector;
module default {
scalar type EmbeddingVector extending ext::pgvector::vector<1536>;
type Record {
required collection: str;
text: str;
embedding: EmbeddingVector;
external_id: str {
constraint exclusive;
};
metadata: json;
index ext::pgvector::hnsw_cosine(m := 16, ef_construction := 128)
on (.embedding)
}
}
""".strip()
with open("dbschema/default.gel", "w") as f:
f.write(schema_content)
schema_content = """ using extension pgvector; module default { scalar type EmbeddingVector extending ext::pgvector::vector<1536>; type Record { required collection: str; text: str; embedding: EmbeddingVector; external_id: str { constraint exclusive; }; metadata: json; index ext::pgvector::hnsw_cosine(m := 16, ef_construction := 128) on (.embedding) } } """.strip() with open("dbschema/default.gel", "w") as f: f.write(schema_content)
为了将 schema 更改应用到数据库,请使用 Gel 的迁移工具运行迁移
In [ ]
已复制!
! gel migration create --non-interactive
! gel migrate
! gel migration create --non-interactive ! gel migrate
从这一点开始,GelVectorStore
可以作为 LlamaIndex 中任何其他向量存储的直接替代品。
创建索引¶
In [ ]
已复制!
vector_store = GelVectorStore()
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
documents, storage_context=storage_context, show_progress=True
)
query_engine = index.as_query_engine()
vector_store = GelVectorStore() storage_context = StorageContext.from_defaults(vector_store=vector_store) index = VectorStoreIndex.from_documents( documents, storage_context=storage_context, show_progress=True ) query_engine = index.as_query_engine()
查询索引¶
现在我们可以使用索引提问了。
In [ ]
已复制!
response = query_engine.query("What did the author do?")
response = query_engine.query("What did the author do?")
In [ ]
已复制!
print(textwrap.fill(str(response), 100))
print(textwrap.fill(str(response), 100))
In [ ]
已复制!
response = query_engine.query("What happened in the mid 1980s?")
response = query_engine.query("What happened in the mid 1980s?")
In [ ]
已复制!
print(textwrap.fill(str(response), 100))
print(textwrap.fill(str(response), 100))
元数据过滤器¶
GelVectorStore 支持在节点中存储元数据,并在检索步骤中基于该元数据进行过滤。
下载 git 提交数据集¶
In [ ]
已复制!
!mkdir -p 'data/git_commits/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/csv/commit_history.csv' -O 'data/git_commits/commit_history.csv'
!mkdir -p 'data/git_commits/' !wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/csv/commit_history.csv' -O 'data/git_commits/commit_history.csv'
In [ ]
已复制!
import csv
with open("data/git_commits/commit_history.csv", "r") as f:
commits = list(csv.DictReader(f))
print(commits[0])
print(len(commits))
import csv with open("data/git_commits/commit_history.csv", "r") as f: commits = list(csv.DictReader(f)) print(commits[0]) print(len(commits))
添加带有自定义元数据的节点¶
In [ ]
已复制!
# Create TextNode for each of the first 100 commits
from llama_index.core.schema import TextNode
from datetime import datetime
import re
nodes = []
dates = set()
authors = set()
for commit in commits[:100]:
author_email = commit["author"].split("<")[1][:-1]
commit_date = datetime.strptime(
commit["date"], "%a %b %d %H:%M:%S %Y %z"
).strftime("%Y-%m-%d")
commit_text = commit["change summary"]
if commit["change details"]:
commit_text += "\n\n" + commit["change details"]
fixes = re.findall(r"#(\d+)", commit_text, re.IGNORECASE)
nodes.append(
TextNode(
text=commit_text,
metadata={
"commit_date": commit_date,
"author": author_email,
"fixes": fixes,
},
)
)
dates.add(commit_date)
authors.add(author_email)
print(nodes[0])
print(min(dates), "to", max(dates))
print(authors)
# Create TextNode for each of the first 100 commits from llama_index.core.schema import TextNode from datetime import datetime import re nodes = [] dates = set() authors = set() for commit in commits[:100]: author_email = commit["author"].split("<")[1][:-1] commit_date = datetime.strptime( commit["date"], "%a %b %d %H:%M:%S %Y %z" ).strftime("%Y-%m-%d") commit_text = commit["change summary"] if commit["change details"]: commit_text += "\n\n" + commit["change details"] fixes = re.findall(r"#(\d+)", commit_text, re.IGNORECASE) nodes.append( TextNode( text=commit_text, metadata={ "commit_date": commit_date, "author": author_email, "fixes": fixes, }, ) ) dates.add(commit_date) authors.add(author_email) print(nodes[0]) print(min(dates), "to", max(dates)) print(authors)
In [ ]
已复制!
vector_store = GelVectorStore()
index = VectorStoreIndex.from_vector_store(vector_store=vector_store)
index.insert_nodes(nodes)
vector_store = GelVectorStore() index = VectorStoreIndex.from_vector_store(vector_store=vector_store) index.insert_nodes(nodes)
In [ ]
已复制!
print(index.as_query_engine().query("How did Lakshmi fix the segfault?"))
print(index.as_query_engine().query("How did Lakshmi fix the segfault?"))
应用元数据过滤器¶
现在,在检索节点时,我们可以按提交作者或按日期进行过滤。
In [ ]
已复制!
from llama_index.core.vector_stores.types import (
MetadataFilter,
MetadataFilters,
)
filters = MetadataFilters(
filters=[
MetadataFilter(key="author", value="[email protected]"),
MetadataFilter(key="author", value="[email protected]"),
],
condition="or",
)
retriever = index.as_retriever(
similarity_top_k=10,
filters=filters,
)
retrieved_nodes = retriever.retrieve("What is this software project about?")
for node in retrieved_nodes:
print(node.node.metadata)
from llama_index.core.vector_stores.types import ( MetadataFilter, MetadataFilters, ) filters = MetadataFilters( filters=[ MetadataFilter(key="author", value="[email protected]"), MetadataFilter(key="author", value="[email protected]"), ], condition="or", ) retriever = index.as_retriever( similarity_top_k=10, filters=filters, ) retrieved_nodes = retriever.retrieve("What is this software project about?") for node in retrieved_nodes: print(node.node.metadata)
In [ ]
已复制!
filters = MetadataFilters(
filters=[
MetadataFilter(key="commit_date", value="2023-08-15", operator=">="),
MetadataFilter(key="commit_date", value="2023-08-25", operator="<="),
],
condition="and",
)
retriever = index.as_retriever(
similarity_top_k=10,
filters=filters,
)
retrieved_nodes = retriever.retrieve("What is this software project about?")
for node in retrieved_nodes:
print(node.node.metadata)
filters = MetadataFilters( filters=[ MetadataFilter(key="commit_date", value="2023-08-15", operator=">="), MetadataFilter(key="commit_date", value="2023-08-25", operator="<="), ], condition="and", ) retriever = index.as_retriever( similarity_top_k=10, filters=filters, ) retrieved_nodes = retriever.retrieve("What is this software project about?") for node in retrieved_nodes: print(node.node.metadata)
应用嵌套过滤器¶
在上述示例中,我们使用 AND 或 OR 组合了多个过滤器。我们也可以组合多组过滤器。
In [ ]
已复制!
filters = MetadataFilters(
filters=[
MetadataFilters(
filters=[
MetadataFilter(
key="commit_date", value="2023-08-01", operator=">="
),
MetadataFilter(
key="commit_date", value="2023-08-15", operator="<="
),
],
condition="and",
),
MetadataFilters(
filters=[
MetadataFilter(key="author", value="[email protected]"),
MetadataFilter(key="author", value="[email protected]"),
],
condition="or",
),
],
condition="and",
)
retriever = index.as_retriever(
similarity_top_k=10,
filters=filters,
)
retrieved_nodes = retriever.retrieve("What is this software project about?")
for node in retrieved_nodes:
print(node.node.metadata)
filters = MetadataFilters( filters=[ MetadataFilters( filters=[ MetadataFilter( key="commit_date", value="2023-08-01", operator=">=" ), MetadataFilter( key="commit_date", value="2023-08-15", operator="<=" ), ], condition="and", ), MetadataFilters( filters=[ MetadataFilter(key="author", value="[email protected]"), MetadataFilter(key="author", value="[email protected]"), ], condition="or", ), ], condition="and", ) retriever = index.as_retriever( similarity_top_k=10, filters=filters, ) retrieved_nodes = retriever.retrieve("What is this software project about?") for node in retrieved_nodes: print(node.node.metadata)
上述内容可以使用 IN 运算符进行简化。GelVectorStore
支持 in
、nin
和 contains
用于将元素与列表进行比较。
In [ ]
已复制!
filters = MetadataFilters(
filters=[
MetadataFilter(key="commit_date", value="2023-08-01", operator=">="),
MetadataFilter(key="commit_date", value="2023-08-15", operator="<="),
MetadataFilter(
key="author",
value=["[email protected]", "[email protected]"],
operator="in",
),
],
condition="and",
)
retriever = index.as_retriever(
similarity_top_k=10,
filters=filters,
)
retrieved_nodes = retriever.retrieve("What is this software project about?")
for node in retrieved_nodes:
print(node.node.metadata)
filters = MetadataFilters( filters=[ MetadataFilter(key="commit_date", value="2023-08-01", operator=">="), MetadataFilter(key="commit_date", value="2023-08-15", operator="<="), MetadataFilter( key="author", value=["[email protected]", "[email protected]"], operator="in", ), ], condition="and", ) retriever = index.as_retriever( similarity_top_k=10, filters=filters, ) retrieved_nodes = retriever.retrieve("What is this software project about?") for node in retrieved_nodes: print(node.node.metadata)
In [ ]
已复制!
# Same thing, with NOT IN
filters = MetadataFilters(
filters=[
MetadataFilter(key="commit_date", value="2023-08-01", operator=">="),
MetadataFilter(key="commit_date", value="2023-08-15", operator="<="),
MetadataFilter(
key="author",
value=["[email protected]", "[email protected]"],
operator="nin",
),
],
condition="and",
)
retriever = index.as_retriever(
similarity_top_k=10,
filters=filters,
)
retrieved_nodes = retriever.retrieve("What is this software project about?")
for node in retrieved_nodes:
print(node.node.metadata)
# Same thing, with NOT IN filters = MetadataFilters( filters=[ MetadataFilter(key="commit_date", value="2023-08-01", operator=">="), MetadataFilter(key="commit_date", value="2023-08-15", operator="<="), MetadataFilter( key="author", value=["[email protected]", "[email protected]"], operator="nin", ), ], condition="and", ) retriever = index.as_retriever( similarity_top_k=10, filters=filters, ) retrieved_nodes = retriever.retrieve("What is this software project about?") for node in retrieved_nodes: print(node.node.metadata)
In [ ]
已复制!
# CONTAINS
filters = MetadataFilters(
filters=[
MetadataFilter(key="fixes", value="5680", operator="contains"),
]
)
retriever = index.as_retriever(
similarity_top_k=10,
filters=filters,
)
retrieved_nodes = retriever.retrieve("How did these commits fix the issue?")
for node in retrieved_nodes:
print(node.node.metadata)
# CONTAINS filters = MetadataFilters( filters=[ MetadataFilter(key="fixes", value="5680", operator="contains"), ] ) retriever = index.as_retriever( similarity_top_k=10, filters=filters, ) retrieved_nodes = retriever.retrieve("How did these commits fix the issue?") for node in retrieved_nodes: print(node.node.metadata)