Postgres 向量存储¶
在这个 notebook 中,我们将展示如何使用 Postgresql 和 pgvector 在 LlamaIndex 中执行向量搜索
如果你正在 Colab 中打开此 Notebook,你可能需要安装 LlamaIndex 🦙。
%pip install llama-index-vector-stores-postgres
!pip install llama-index
运行以下单元格将在 Colab 中安装带有 PGVector 的 Postgres。
!sudo apt update
!echo | sudo apt install -y postgresql-common
!echo | sudo /usr/share/postgresql-common/pgdg/apt.postgresql.org.sh
!echo | sudo apt install postgresql-15-pgvector
!sudo service postgresql start
!sudo -u postgres psql -c "ALTER USER postgres PASSWORD 'password';"
!sudo -u postgres psql -c "CREATE DATABASE vector_db;"
# import logging
# import sys
# Uncomment to see debug logs
# logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
# logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
from llama_index.core import SimpleDirectoryReader, StorageContext
from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.postgres import PGVectorStore
import textwrap
import openai
设置 OpenAI¶
第一步是配置 OpenAI 密钥。它将用于为加载到索引中的文档创建嵌入。
import os
os.environ["OPENAI_API_KEY"] = "<your key>"
openai.api_key = os.environ["OPENAI_API_KEY"]
下载数据
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
--2024-07-29 15:29:26-- https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8000::154, 2606:50c0:8002::154, 2606:50c0:8003::154, ... Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8000::154|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 75042 (73K) [text/plain] Saving to: ‘data/paul_graham/paul_graham_essay.txt’ data/paul_graham/pa 100%[===================>] 73.28K --.-KB/s in 0.04s 2024-07-29 15:29:26 (1.75 MB/s) - ‘data/paul_graham/paul_graham_essay.txt’ saved [75042/75042]
加载文档¶
使用 SimpleDirectoryReader 加载存储在 data/paul_graham/
中的文档
documents = SimpleDirectoryReader("./data/paul_graham").load_data()
print("Document ID:", documents[0].doc_id)
Document ID: e9a28a97-73af-42dd-8b40-9c585e222e69
创建数据库¶
使用本地主机上已有的 Postgres 实例,创建我们将使用的数据库。
import psycopg2
connection_string = "postgresql://postgres:password@localhost:5432"
db_name = "vector_db"
conn = psycopg2.connect(connection_string)
conn.autocommit = True
with conn.cursor() as c:
c.execute(f"DROP DATABASE IF EXISTS {db_name}")
c.execute(f"CREATE DATABASE {db_name}")
创建索引¶
在这里,我们使用之前加载的文档创建一个由 Postgres 支持的索引。PGVectorStore 接受几个参数。下面的示例构造了一个 PGVectorStore,它使用 HNSW 索引,其中 m = 16,ef_construction = 64,ef_search = 40,并使用 vector_cosine_ops
方法。
from sqlalchemy import make_url
url = make_url(connection_string)
vector_store = PGVectorStore.from_params(
database=db_name,
host=url.host,
password=url.password,
port=url.port,
user=url.username,
table_name="paul_graham_essay",
embed_dim=1536, # openai embedding dimension
hnsw_kwargs={
"hnsw_m": 16,
"hnsw_ef_construction": 64,
"hnsw_ef_search": 40,
"hnsw_dist_method": "vector_cosine_ops",
},
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
documents, storage_context=storage_context, show_progress=True
)
query_engine = index.as_query_engine()
查询索引¶
我们现在可以使用我们的索引来提问。
response = query_engine.query("What did the author do?")
print(textwrap.fill(str(response), 100))
The author worked on writing essays, programming, developing software, giving talks, and starting a startup.
response = query_engine.query("What happened in the mid 1980s?")
print(textwrap.fill(str(response), 100))
In the mid-1980s, the context mentions the presence of a famous fund manager who was not much older than the author and was super rich. This sparked a thought in the author's mind about becoming rich as well to have the freedom to work on whatever he wanted.
查询现有索引¶
vector_store = PGVectorStore.from_params(
database="vector_db",
host="localhost",
password="password",
port=5432,
user="postgres",
table_name="paul_graham_essay",
embed_dim=1536, # openai embedding dimension
hnsw_kwargs={
"hnsw_m": 16,
"hnsw_ef_construction": 64,
"hnsw_ef_search": 40,
"hnsw_dist_method": "vector_cosine_ops",
},
)
index = VectorStoreIndex.from_vector_store(vector_store=vector_store)
query_engine = index.as_query_engine()
response = query_engine.query("What did the author do?")
print(textwrap.fill(str(response), 100))
The author worked on writing essays, programming, developing software, giving talks, and starting a startup.
混合搜索¶
要启用混合搜索,您需要
- 在构造
PGVectorStore
时传入hybrid_search=True
(并可选择使用所需的语言配置text_search_config
) - 在构造查询引擎时传入
vector_store_query_mode="hybrid"
(此配置在内部传递给检索器)。您还可以选择设置sparse_top_k
来配置应从稀疏文本搜索中获得多少结果(默认为与similarity_top_k
相同的值)。
from sqlalchemy import make_url
url = make_url(connection_string)
hybrid_vector_store = PGVectorStore.from_params(
database=db_name,
host=url.host,
password=url.password,
port=url.port,
user=url.username,
table_name="paul_graham_essay_hybrid_search",
embed_dim=1536, # openai embedding dimension
hybrid_search=True,
text_search_config="english",
hnsw_kwargs={
"hnsw_m": 16,
"hnsw_ef_construction": 64,
"hnsw_ef_search": 40,
"hnsw_dist_method": "vector_cosine_ops",
},
)
storage_context = StorageContext.from_defaults(
vector_store=hybrid_vector_store
)
hybrid_index = VectorStoreIndex.from_documents(
documents, storage_context=storage_context
)
hybrid_query_engine = hybrid_index.as_query_engine(
vector_store_query_mode="hybrid", sparse_top_k=2
)
hybrid_response = hybrid_query_engine.query(
"Who does Paul Graham think of with the word schtick"
)
print(hybrid_response)
Roy Lichtenstein
使用 QueryFusionRetriever 改进混合搜索¶
由于文本搜索和向量搜索的得分计算方式不同,仅通过文本搜索找到的节点得分会低很多。
通常可以通过使用 QueryFusionRetriever
来提高混合搜索性能,它能更好地利用互信息对节点进行排序。
from llama_index.core.response_synthesizers import CompactAndRefine
from llama_index.core.retrievers import QueryFusionRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
vector_retriever = hybrid_index.as_retriever(
vector_store_query_mode="default",
similarity_top_k=5,
)
text_retriever = hybrid_index.as_retriever(
vector_store_query_mode="sparse",
similarity_top_k=5, # interchangeable with sparse_top_k in this context
)
retriever = QueryFusionRetriever(
[vector_retriever, text_retriever],
similarity_top_k=5,
num_queries=1, # set this to 1 to disable query generation
mode="relative_score",
use_async=False,
)
response_synthesizer = CompactAndRefine()
query_engine = RetrieverQueryEngine(
retriever=retriever,
response_synthesizer=response_synthesizer,
)
response = query_engine.query(
"Who does Paul Graham think of with the word schtick, and why?"
)
print(response)
Paul Graham associates the word "schtick" with artists who have a distinctive signature style in their work. This signature style serves as a visual identifier unique to the artist, making their work easily recognizable and attributed to them.
元数据过滤器¶
PGVectorStore 支持在节点中存储元数据,并在检索步骤中根据该元数据进行过滤。
下载 Git 提交数据集¶
!mkdir -p 'data/git_commits/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/csv/commit_history.csv' -O 'data/git_commits/commit_history.csv'
import csv
with open("data/git_commits/commit_history.csv", "r") as f:
commits = list(csv.DictReader(f))
print(commits[0])
print(len(commits))
{'commit': '44e41c12ab25e36c202f58e068ced262eadc8d16', 'author': 'Lakshmi Narayanan Sreethar<[email protected]>', 'date': 'Tue Sep 5 21:03:21 2023 +0530', 'change summary': 'Fix segfault in set_integer_now_func', 'change details': 'When an invalid function oid is passed to set_integer_now_func, it finds out that the function oid is invalid but before throwing the error, it calls ReleaseSysCache on an invalid tuple causing a segfault. Fixed that by removing the invalid call to ReleaseSysCache. Fixes #6037 '} 4167
添加带有自定义元数据的节点¶
# Create TextNode for each of the first 100 commits
from llama_index.core.schema import TextNode
from datetime import datetime
import re
nodes = []
dates = set()
authors = set()
for commit in commits[:100]:
author_email = commit["author"].split("<")[1][:-1]
commit_date = datetime.strptime(
commit["date"], "%a %b %d %H:%M:%S %Y %z"
).strftime("%Y-%m-%d")
commit_text = commit["change summary"]
if commit["change details"]:
commit_text += "\n\n" + commit["change details"]
fixes = re.findall(r"#(\d+)", commit_text, re.IGNORECASE)
nodes.append(
TextNode(
text=commit_text,
metadata={
"commit_date": commit_date,
"author": author_email,
"fixes": fixes,
},
)
)
dates.add(commit_date)
authors.add(author_email)
print(nodes[0])
print(min(dates), "to", max(dates))
print(authors)
Node ID: 50fdca05-b1ce-41d8-b771-b13aa3ad7df0 Text: Fix segfault in set_integer_now_func When an invalid function oid is passed to set_integer_now_func, it finds out that the function oid is invalid but before throwing the error, it calls ReleaseSysCache on an invalid tuple causing a segfault. Fixed that by removing the invalid call to ReleaseSysCache. Fixes #6037 2023-03-22 to 2023-09-05 {'[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]'}
vector_store = PGVectorStore.from_params(
database=db_name,
host=url.host,
password=url.password,
port=url.port,
user=url.username,
table_name="metadata_filter_demo3",
embed_dim=1536, # openai embedding dimension
hnsw_kwargs={
"hnsw_m": 16,
"hnsw_ef_construction": 64,
"hnsw_ef_search": 40,
"hnsw_dist_method": "vector_cosine_ops",
},
)
index = VectorStoreIndex.from_vector_store(vector_store=vector_store)
index.insert_nodes(nodes)
print(index.as_query_engine().query("How did Lakshmi fix the segfault?"))
Lakshmi fixed the segfault by removing the invalid call to ReleaseSysCache that was causing the issue.
应用元数据过滤器¶
现在,我们可以在检索节点时按提交作者或日期进行过滤。
from llama_index.core.vector_stores.types import (
MetadataFilter,
MetadataFilters,
)
filters = MetadataFilters(
filters=[
MetadataFilter(key="author", value="[email protected]"),
MetadataFilter(key="author", value="[email protected]"),
],
condition="or",
)
retriever = index.as_retriever(
similarity_top_k=10,
filters=filters,
)
retrieved_nodes = retriever.retrieve("What is this software project about?")
for node in retrieved_nodes:
print(node.node.metadata)
{'commit_date': '2023-08-07', 'author': '[email protected]', 'fixes': []} {'commit_date': '2023-08-27', 'author': '[email protected]', 'fixes': []} {'commit_date': '2023-07-13', 'author': '[email protected]', 'fixes': []} {'commit_date': '2023-08-07', 'author': '[email protected]', 'fixes': []} {'commit_date': '2023-08-30', 'author': '[email protected]', 'fixes': []} {'commit_date': '2023-08-15', 'author': '[email protected]', 'fixes': []} {'commit_date': '2023-08-23', 'author': '[email protected]', 'fixes': []} {'commit_date': '2023-08-10', 'author': '[email protected]', 'fixes': []} {'commit_date': '2023-07-25', 'author': '[email protected]', 'fixes': ['5892']} {'commit_date': '2023-08-21', 'author': '[email protected]', 'fixes': []}
filters = MetadataFilters(
filters=[
MetadataFilter(key="commit_date", value="2023-08-15", operator=">="),
MetadataFilter(key="commit_date", value="2023-08-25", operator="<="),
],
condition="and",
)
retriever = index.as_retriever(
similarity_top_k=10,
filters=filters,
)
retrieved_nodes = retriever.retrieve("What is this software project about?")
for node in retrieved_nodes:
print(node.node.metadata)
{'commit_date': '2023-08-23', 'author': '[email protected]', 'fixes': []} {'commit_date': '2023-08-17', 'author': '[email protected]', 'fixes': []} {'commit_date': '2023-08-15', 'author': '[email protected]', 'fixes': []} {'commit_date': '2023-08-15', 'author': '[email protected]', 'fixes': []} {'commit_date': '2023-08-24', 'author': '[email protected]', 'fixes': []} {'commit_date': '2023-08-15', 'author': '[email protected]', 'fixes': []} {'commit_date': '2023-08-23', 'author': '[email protected]', 'fixes': []} {'commit_date': '2023-08-21', 'author': '[email protected]', 'fixes': []} {'commit_date': '2023-08-20', 'author': '[email protected]', 'fixes': []} {'commit_date': '2023-08-21', 'author': '[email protected]', 'fixes': []}
应用嵌套过滤器¶
在上述示例中,我们使用 AND 或 OR 组合了多个过滤器。我们还可以组合多组过滤器。
例如,在 SQL 中
WHERE (commit_date >= '2023-08-01' AND commit_date <= '2023-08-15') AND (author = '[email protected]' OR author = '[email protected]')
filters = MetadataFilters(
filters=[
MetadataFilters(
filters=[
MetadataFilter(
key="commit_date", value="2023-08-01", operator=">="
),
MetadataFilter(
key="commit_date", value="2023-08-15", operator="<="
),
],
condition="and",
),
MetadataFilters(
filters=[
MetadataFilter(key="author", value="[email protected]"),
MetadataFilter(key="author", value="[email protected]"),
],
condition="or",
),
],
condition="and",
)
retriever = index.as_retriever(
similarity_top_k=10,
filters=filters,
)
retrieved_nodes = retriever.retrieve("What is this software project about?")
for node in retrieved_nodes:
print(node.node.metadata)
{'commit_date': '2023-08-07', 'author': '[email protected]', 'fixes': []} {'commit_date': '2023-08-07', 'author': '[email protected]', 'fixes': []} {'commit_date': '2023-08-15', 'author': '[email protected]', 'fixes': []} {'commit_date': '2023-08-10', 'author': '[email protected]', 'fixes': []}
上述内容可以通过使用 IN 运算符来简化。PGVectorStore
支持 in
、nin
和 contains
用于将元素与列表进行比较。
filters = MetadataFilters(
filters=[
MetadataFilter(key="commit_date", value="2023-08-01", operator=">="),
MetadataFilter(key="commit_date", value="2023-08-15", operator="<="),
MetadataFilter(
key="author",
value=["[email protected]", "[email protected]"],
operator="in",
),
],
condition="and",
)
retriever = index.as_retriever(
similarity_top_k=10,
filters=filters,
)
retrieved_nodes = retriever.retrieve("What is this software project about?")
for node in retrieved_nodes:
print(node.node.metadata)
{'commit_date': '2023-08-07', 'author': '[email protected]', 'fixes': []} {'commit_date': '2023-08-07', 'author': '[email protected]', 'fixes': []} {'commit_date': '2023-08-15', 'author': '[email protected]', 'fixes': []} {'commit_date': '2023-08-10', 'author': '[email protected]', 'fixes': []}
# Same thing, with NOT IN
filters = MetadataFilters(
filters=[
MetadataFilter(key="commit_date", value="2023-08-01", operator=">="),
MetadataFilter(key="commit_date", value="2023-08-15", operator="<="),
MetadataFilter(
key="author",
value=["[email protected]", "[email protected]"],
operator="nin",
),
],
condition="and",
)
retriever = index.as_retriever(
similarity_top_k=10,
filters=filters,
)
retrieved_nodes = retriever.retrieve("What is this software project about?")
for node in retrieved_nodes:
print(node.node.metadata)
{'commit_date': '2023-08-09', 'author': '[email protected]', 'fixes': ['5805']} {'commit_date': '2023-08-15', 'author': '[email protected]', 'fixes': []} {'commit_date': '2023-08-15', 'author': '[email protected]', 'fixes': []} {'commit_date': '2023-08-11', 'author': '[email protected]', 'fixes': []} {'commit_date': '2023-08-03', 'author': '[email protected]', 'fixes': []} {'commit_date': '2023-08-09', 'author': '[email protected]', 'fixes': ['5923', '5680', '5774', '5786', '5906', '5912']} {'commit_date': '2023-08-03', 'author': '[email protected]', 'fixes': ['5908']} {'commit_date': '2023-08-01', 'author': '[email protected]', 'fixes': []} {'commit_date': '2023-08-10', 'author': '[email protected]', 'fixes': []} {'commit_date': '2023-08-10', 'author': '[email protected]', 'fixes': []}
# CONTAINS
filters = MetadataFilters(
filters=[
MetadataFilter(key="fixes", value="5680", operator="contains"),
]
)
retriever = index.as_retriever(
similarity_top_k=10,
filters=filters,
)
retrieved_nodes = retriever.retrieve("How did these commits fix the issue?")
for node in retrieved_nodes:
print(node.node.metadata)
{'commit_date': '2023-08-09', 'author': '[email protected]', 'fixes': ['5923', '5680', '5774', '5786', '5906', '5912']}
PgVector 查询选项¶
retriever = index.as_retriever(
vector_store_query_mode="hybrid",
similarity_top_k=5,
vector_store_kwargs={"ivfflat_probes": 10},
)
retriever = index.as_retriever(
vector_store_query_mode="hybrid",
similarity_top_k=5,
vector_store_kwargs={"hnsw_ef_search": 300},
)