Postgres 向量存储¶

在这个 notebook 中，我们将展示如何使用 Postgresql 和 pgvector 在 LlamaIndex 中执行向量搜索

如果你正在 Colab 中打开此 Notebook，你可能需要安装 LlamaIndex 🦙。

In [ ]

已复制!

%pip install llama-index-vector-stores-postgres
%pip install llama-index-vector-stores-postgres

In [ ]

已复制!

!pip install llama-index
!pip install llama-index

运行以下单元格将在 Colab 中安装带有 PGVector 的 Postgres。

In [ ]

已复制!





!sudo apt update
!echo | sudo apt install -y postgresql-common
!echo | sudo /usr/share/postgresql-common/pgdg/apt.postgresql.org.sh
!echo | sudo apt install postgresql-15-pgvector
!sudo service postgresql start
!sudo -u postgres psql -c "ALTER USER postgres PASSWORD 'password';"
!sudo -u postgres psql -c "CREATE DATABASE vector_db;"
!sudo apt update !echo | sudo apt install -y postgresql-common !echo | sudo /usr/share/postgresql-common/pgdg/apt.postgresql.org.sh !echo | sudo apt install postgresql-15-pgvector !sudo service postgresql start !sudo -u postgres psql -c "ALTER USER postgres PASSWORD 'password';" !sudo -u postgres psql -c "CREATE DATABASE vector_db;"

In [ ]

已复制!





# import logging
# import sys

# Uncomment to see debug logs
# logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
# logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

from llama_index.core import SimpleDirectoryReader, StorageContext
from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.postgres import PGVectorStore
import textwrap
import openai
# import logging # import sys # Uncomment to see debug logs # logging.basicConfig(stream=sys.stdout, level=logging.DEBUG) # logging.getLogger().addHandler(new logging.StreamHandler(stream=sys.stdout)); from llama_index.core import SimpleDirectoryReader, StorageContext from llama_index.core import VectorStoreIndex from llama_index.vector_stores.postgres import PGVectorStore import textwrap import openai

设置 OpenAI¶

第一步是配置 OpenAI 密钥。它将用于为加载到索引中的文档创建嵌入。

In [ ]

已复制!

import os

os.environ["OPENAI_API_KEY"] = "<your key>"
openai.api_key = os.environ["OPENAI_API_KEY"]
import os os.environ["OPENAI_API_KEY"] = "" openai.api_key = os.environ["OPENAI_API_KEY"]

下载数据

In [ ]

已复制!

!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
!mkdir -p 'data/paul_graham/' !wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

--2024-07-29 15:29:26--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8000::154, 2606:50c0:8002::154, 2606:50c0:8003::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8000::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75042 (73K) [text/plain]
Saving to: ‘data/paul_graham/paul_graham_essay.txt’

data/paul_graham/pa 100%[===================>]  73.28K  --.-KB/s    in 0.04s   

2024-07-29 15:29:26 (1.75 MB/s) - ‘data/paul_graham/paul_graham_essay.txt’ saved [75042/75042]

加载文档¶

使用 SimpleDirectoryReader 加载存储在 data/paul_graham/ 中的文档

In [ ]

已复制!

documents = SimpleDirectoryReader("./data/paul_graham").load_data()
print("Document ID:", documents[0].doc_id)
documents = SimpleDirectoryReader("./data/paul_graham").load_data() print("文档 ID:", documents[0].doc_id)

Document ID: e9a28a97-73af-42dd-8b40-9c585e222e69

创建数据库¶

使用本地主机上已有的 Postgres 实例，创建我们将使用的数据库。

In [ ]

已复制!

import psycopg2

connection_string = "postgresql://postgres:password@localhost:5432"
db_name = "vector_db"
conn = psycopg2.connect(connection_string)
conn.autocommit = True

with conn.cursor() as c:
    c.execute(f"DROP DATABASE IF EXISTS {db_name}")
    c.execute(f"CREATE DATABASE {db_name}")
import psycopg2 connection_string = "postgresql://postgres:password@localhost:5432" db_name = "vector_db" conn = psycopg2.connect(connection_string) conn.autocommit = True with conn.cursor() as c: c.execute(f"DROP DATABASE IF EXISTS {db_name}") c.execute(f"CREATE DATABASE {db_name}")

创建索引¶

在这里，我们使用之前加载的文档创建一个由 Postgres 支持的索引。PGVectorStore 接受几个参数。下面的示例构造了一个 PGVectorStore，它使用 HNSW 索引，其中 m = 16，ef_construction = 64，ef_search = 40，并使用 vector_cosine_ops 方法。

In [ ]

已复制!





from sqlalchemy import make_url

url = make_url(connection_string)
vector_store = PGVectorStore.from_params(
    database=db_name,
    host=url.host,
    password=url.password,
    port=url.port,
    user=url.username,
    table_name="paul_graham_essay",
    embed_dim=1536,  # openai embedding dimension
    hnsw_kwargs={
        "hnsw_m": 16,
        "hnsw_ef_construction": 64,
        "hnsw_ef_search": 40,
        "hnsw_dist_method": "vector_cosine_ops",
    },
)

storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context, show_progress=True
)
query_engine = index.as_query_engine()
from sqlalchemy import make_url url = make_url(connection_string) vector_store = PGVectorStore.from_params( database=db_name, host=url.host, password=url.password, port=url.port, user=url.username, table_name="paul_graham_essay", embed_dim=1536, # openai 嵌入维度 hnsw_kwargs={ "hnsw_m": 16, "hnsw_ef_construction": 64, "hnsw_ef_search": 40, "hnsw_dist_method": "vector_cosine_ops", }, ) storage_context = StorageContext.from_defaults(vector_store=vector_store) index = VectorStoreIndex.from_documents( documents, storage_context=storage_context, show_progress=True ) query_engine = index.as_query_engine()

查询索引¶

我们现在可以使用我们的索引来提问。

In [ ]

已复制!

response = query_engine.query("What did the author do?")
response = query_engine.query("What did the author do?")

In [ ]

已复制!

print(textwrap.fill(str(response), 100))
print(textwrap.fill(str(response), 100))

The author worked on writing essays, programming, developing software, giving talks, and starting a
startup.

In [ ]

已复制!

response = query_engine.query("What happened in the mid 1980s?")
response = query_engine.query("What happened in the mid 1980s?")

In [ ]

已复制!

print(textwrap.fill(str(response), 100))
print(textwrap.fill(str(response), 100))

In the mid-1980s, the context mentions the presence of a famous fund manager who was not much older
than the author and was super rich. This sparked a thought in the author's mind about becoming rich
as well to have the freedom to work on whatever he wanted.

查询现有索引¶

In [ ]

已复制!





vector_store = PGVectorStore.from_params(
    database="vector_db",
    host="localhost",
    password="password",
    port=5432,
    user="postgres",
    table_name="paul_graham_essay",
    embed_dim=1536,  # openai embedding dimension
    hnsw_kwargs={
        "hnsw_m": 16,
        "hnsw_ef_construction": 64,
        "hnsw_ef_search": 40,
        "hnsw_dist_method": "vector_cosine_ops",
    },
)

index = VectorStoreIndex.from_vector_store(vector_store=vector_store)
query_engine = index.as_query_engine()
vector_store = PGVectorStore.from_params( database="vector_db", host="localhost", password="password", port=5432, user="postgres", table_name="paul_graham_essay", embed_dim=1536, # openai 嵌入维度 hnsw_kwargs={ "hnsw_m": 16, "hnsw_ef_construction": 64, "hnsw_ef_search": 40, "hnsw_dist_method": "vector_cosine_ops", }, ) index = VectorStoreIndex.from_vector_store(vector_store=vector_store) query_engine = index.as_query_engine()

In [ ]

已复制!

response = query_engine.query("What did the author do?")
response = query_engine.query("What did the author do?")

In [ ]

已复制!

print(textwrap.fill(str(response), 100))
print(textwrap.fill(str(response), 100))

The author worked on writing essays, programming, developing software, giving talks, and starting a
startup.

混合搜索¶

要启用混合搜索，您需要

在构造 PGVectorStore 时传入 hybrid_search=True（并可选择使用所需的语言配置 text_search_config）
在构造查询引擎时传入 vector_store_query_mode="hybrid"（此配置在内部传递给检索器）。您还可以选择设置 sparse_top_k 来配置应从稀疏文本搜索中获得多少结果（默认为与 similarity_top_k 相同的值）。

In [ ]

已复制!





from sqlalchemy import make_url

url = make_url(connection_string)
hybrid_vector_store = PGVectorStore.from_params(
    database=db_name,
    host=url.host,
    password=url.password,
    port=url.port,
    user=url.username,
    table_name="paul_graham_essay_hybrid_search",
    embed_dim=1536,  # openai embedding dimension
    hybrid_search=True,
    text_search_config="english",
    hnsw_kwargs={
        "hnsw_m": 16,
        "hnsw_ef_construction": 64,
        "hnsw_ef_search": 40,
        "hnsw_dist_method": "vector_cosine_ops",
    },
)

storage_context = StorageContext.from_defaults(
    vector_store=hybrid_vector_store
)
hybrid_index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)
from sqlalchemy import make_url url = make_url(connection_string) hybrid_vector_store = PGVectorStore.from_params( database=db_name, host=url.host, password=url.password, port=url.port, user=url.username, table_name="paul_graham_essay_hybrid_search", embed_dim=1536, # openai 嵌入维度 hybrid_search=True, text_search_config="english", hnsw_kwargs={ "hnsw_m": 16, "hnsw_ef_construction": 64, "hnsw_ef_search": 40, "hnsw_dist_method": "vector_cosine_ops", }, ) storage_context = StorageContext.from_defaults( vector_store=hybrid_vector_store ) hybrid_index = VectorStoreIndex.from_documents( documents, storage_context=storage_context )

In [ ]

已复制!





hybrid_query_engine = hybrid_index.as_query_engine(
    vector_store_query_mode="hybrid", sparse_top_k=2
)
hybrid_response = hybrid_query_engine.query(
    "Who does Paul Graham think of with the word schtick"
)
hybrid_query_engine = hybrid_index.as_query_engine( vector_store_query_mode="hybrid", sparse_top_k=2 ) hybrid_response = hybrid_query_engine.query( "Who does Paul Graham think of with the word schtick" )

In [ ]

已复制!

print(hybrid_response)
print(hybrid_response)

Roy Lichtenstein

使用 QueryFusionRetriever 改进混合搜索¶

由于文本搜索和向量搜索的得分计算方式不同，仅通过文本搜索找到的节点得分会低很多。

通常可以通过使用 QueryFusionRetriever 来提高混合搜索性能，它能更好地利用互信息对节点进行排序。

In [ ]

已复制!





from llama_index.core.response_synthesizers import CompactAndRefine
from llama_index.core.retrievers import QueryFusionRetriever
from llama_index.core.query_engine import RetrieverQueryEngine

vector_retriever = hybrid_index.as_retriever(
    vector_store_query_mode="default",
    similarity_top_k=5,
)
text_retriever = hybrid_index.as_retriever(
    vector_store_query_mode="sparse",
    similarity_top_k=5,  # interchangeable with sparse_top_k in this context
)
retriever = QueryFusionRetriever(
    [vector_retriever, text_retriever],
    similarity_top_k=5,
    num_queries=1,  # set this to 1 to disable query generation
    mode="relative_score",
    use_async=False,
)

response_synthesizer = CompactAndRefine()
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)
from llama_index.core.response_synthesizers import CompactAndRefine from llama_index.core.retrievers import QueryFusionRetriever from llama_index.core.query_engine import RetrieverQueryEngine vector_retriever = hybrid_index.as_retriever( vector_store_query_mode="default", similarity_top_k=5, ) text_retriever = hybrid_index.as_retriever( vector_store_query_mode="sparse", similarity_top_k=5, # 在此上下文中可与 sparse_top_k互换 ) retriever = QueryFusionRetriever( [vector_retriever, text_retriever], similarity_top_k=5, num_queries=1, # 将此设置为 1 以禁用查询生成 mode="relative_score", use_async=False, ) response_synthesizer = CompactAndRefine() query_engine = RetrieverQueryEngine( retriever=retriever, response_synthesizer=response_synthesizer, )

In [ ]

已复制!

response = query_engine.query(
    "Who does Paul Graham think of with the word schtick, and why?"
)
print(response)
response = query_engine.query( "Who does Paul Graham think of with the word schtick, and why?" ) print(response)

Paul Graham associates the word "schtick" with artists who have a distinctive signature style in their work. This signature style serves as a visual identifier unique to the artist, making their work easily recognizable and attributed to them.

元数据过滤器¶

PGVectorStore 支持在节点中存储元数据，并在检索步骤中根据该元数据进行过滤。

下载 Git 提交数据集¶

In [ ]

已复制!

!mkdir -p 'data/git_commits/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/csv/commit_history.csv' -O 'data/git_commits/commit_history.csv'
!mkdir -p 'data/git_commits/' !wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/csv/commit_history.csv' -O 'data/git_commits/commit_history.csv'

In [ ]

已复制!

import csv

with open("data/git_commits/commit_history.csv", "r") as f:
    commits = list(csv.DictReader(f))

print(commits[0])
print(len(commits))
import csv with open("data/git_commits/commit_history.csv", "r") as f: commits = list(csv.DictReader(f)) print(commits[0]) print(len(commits))

{'commit': '44e41c12ab25e36c202f58e068ced262eadc8d16', 'author': 'Lakshmi Narayanan Sreethar<[email protected]>', 'date': 'Tue Sep 5 21:03:21 2023 +0530', 'change summary': 'Fix segfault in set_integer_now_func', 'change details': 'When an invalid function oid is passed to set_integer_now_func, it finds out that the function oid is invalid but before throwing the error, it calls ReleaseSysCache on an invalid tuple causing a segfault. Fixed that by removing the invalid call to ReleaseSysCache.  Fixes #6037 '}
4167

添加带有自定义元数据的节点¶

In [ ]

已复制!





# Create TextNode for each of the first 100 commits
from llama_index.core.schema import TextNode
from datetime import datetime
import re

nodes = []
dates = set()
authors = set()
for commit in commits[:100]:
    author_email = commit["author"].split("<")[1][:-1]
    commit_date = datetime.strptime(
        commit["date"], "%a %b %d %H:%M:%S %Y %z"
    ).strftime("%Y-%m-%d")
    commit_text = commit["change summary"]
    if commit["change details"]:
        commit_text += "\n\n" + commit["change details"]
    fixes = re.findall(r"#(\d+)", commit_text, re.IGNORECASE)
    nodes.append(
        TextNode(
            text=commit_text,
            metadata={
                "commit_date": commit_date,
                "author": author_email,
                "fixes": fixes,
            },
        )
    )
    dates.add(commit_date)
    authors.add(author_email)

print(nodes[0])
print(min(dates), "to", max(dates))
print(authors)
# 为前 100 次提交创建 TextNode from llama_index.core.schema import TextNode from datetime import datetime import re nodes = [] dates = set() authors = set() for commit in commits[:100]: author_email = commit["author"].split("<")[1][:-1] commit_date = datetime.strptime( commit["date"], "%a %b %d %H:%M:%S %Y %z" ).strftime("%Y-%m-%d") commit_text = commit["change summary"] if commit["change details"]: commit_text += "\n\n" + commit["change details"] fixes = re.findall(r"#(\d+)", commit_text, re.IGNORECASE) nodes.append( TextNode( text=commit_text, metadata={ "commit_date": commit_date, "author": author_email, "fixes": fixes, }, ) ) dates.add(commit_date) authors.add(author_email) print(nodes[0]) print(min(dates), "到", max(dates)) print(authors)

Node ID: 50fdca05-b1ce-41d8-b771-b13aa3ad7df0
Text: Fix segfault in set_integer_now_func  When an invalid function
oid is passed to set_integer_now_func, it finds out that the function
oid is invalid but before throwing the error, it calls ReleaseSysCache
on an invalid tuple causing a segfault. Fixed that by removing the
invalid call to ReleaseSysCache.  Fixes #6037
2023-03-22 to 2023-09-05
{'[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]'}

In [ ]

已复制!





vector_store = PGVectorStore.from_params(
    database=db_name,
    host=url.host,
    password=url.password,
    port=url.port,
    user=url.username,
    table_name="metadata_filter_demo3",
    embed_dim=1536,  # openai embedding dimension
    hnsw_kwargs={
        "hnsw_m": 16,
        "hnsw_ef_construction": 64,
        "hnsw_ef_search": 40,
        "hnsw_dist_method": "vector_cosine_ops",
    },
)

index = VectorStoreIndex.from_vector_store(vector_store=vector_store)
index.insert_nodes(nodes)
vector_store = PGVectorStore.from_params( database=db_name, host=url.host, password=url.password, port=url.port, user=url.username, table_name="metadata_filter_demo3", embed_dim=1536, # openai 嵌入维度 hnsw_kwargs={ "hnsw_m": 16, "hnsw_ef_construction": 64, "hnsw_ef_search": 40, "hnsw_dist_method": "vector_cosine_ops", }, ) index = VectorStoreIndex.from_vector_store(vector_store=vector_store) index.insert_nodes(nodes)

In [ ]

已复制!

print(index.as_query_engine().query("How did Lakshmi fix the segfault?"))
print(index.as_query_engine().query("How did Lakshmi fix the segfault?"))

Lakshmi fixed the segfault by removing the invalid call to ReleaseSysCache that was causing the issue.

应用元数据过滤器¶

现在，我们可以在检索节点时按提交作者或日期进行过滤。

In [ ]

已复制!





from llama_index.core.vector_stores.types import (
    MetadataFilter,
    MetadataFilters,
)

filters = MetadataFilters(
    filters=[
        MetadataFilter(key="author", value="[email protected]"),
        MetadataFilter(key="author", value="[email protected]"),
    ],
    condition="or",
)

retriever = index.as_retriever(
    similarity_top_k=10,
    filters=filters,
)

retrieved_nodes = retriever.retrieve("What is this software project about?")

for node in retrieved_nodes:
    print(node.node.metadata)
from llama_index.core.vector_stores.types import ( MetadataFilter, MetadataFilters, ) filters = MetadataFilters( filters=[ MetadataFilter(key="author", value="[email protected]"), MetadataFilter(key="author", value="[email protected]"), ], condition="or", ) retriever = index.as_retriever( similarity_top_k=10, filters=filters, ) retrieved_nodes = retriever.retrieve("What is this software project about?") for node in retrieved_nodes: print(node.node.metadata)

{'commit_date': '2023-08-07', 'author': '[email protected]', 'fixes': []}
{'commit_date': '2023-08-27', 'author': '[email protected]', 'fixes': []}
{'commit_date': '2023-07-13', 'author': '[email protected]', 'fixes': []}
{'commit_date': '2023-08-07', 'author': '[email protected]', 'fixes': []}
{'commit_date': '2023-08-30', 'author': '[email protected]', 'fixes': []}
{'commit_date': '2023-08-15', 'author': '[email protected]', 'fixes': []}
{'commit_date': '2023-08-23', 'author': '[email protected]', 'fixes': []}
{'commit_date': '2023-08-10', 'author': '[email protected]', 'fixes': []}
{'commit_date': '2023-07-25', 'author': '[email protected]', 'fixes': ['5892']}
{'commit_date': '2023-08-21', 'author': '[email protected]', 'fixes': []}

In [ ]

已复制!





filters = MetadataFilters(
    filters=[
        MetadataFilter(key="commit_date", value="2023-08-15", operator=">="),
        MetadataFilter(key="commit_date", value="2023-08-25", operator="<="),
    ],
    condition="and",
)

retriever = index.as_retriever(
    similarity_top_k=10,
    filters=filters,
)

retrieved_nodes = retriever.retrieve("What is this software project about?")

for node in retrieved_nodes:
    print(node.node.metadata)
filters = MetadataFilters( filters=[ MetadataFilter(key="commit_date", value="2023-08-15", operator=">="), MetadataFilter(key="commit_date", value="2023-08-25", operator="<="), ], condition="and", ) retriever = index.as_retriever( similarity_top_k=10, filters=filters, ) retrieved_nodes = retriever.retrieve("What is this software project about?") for node in retrieved_nodes: print(node.node.metadata)

{'commit_date': '2023-08-23', 'author': '[email protected]', 'fixes': []}
{'commit_date': '2023-08-17', 'author': '[email protected]', 'fixes': []}
{'commit_date': '2023-08-15', 'author': '[email protected]', 'fixes': []}
{'commit_date': '2023-08-15', 'author': '[email protected]', 'fixes': []}
{'commit_date': '2023-08-24', 'author': '[email protected]', 'fixes': []}
{'commit_date': '2023-08-15', 'author': '[email protected]', 'fixes': []}
{'commit_date': '2023-08-23', 'author': '[email protected]', 'fixes': []}
{'commit_date': '2023-08-21', 'author': '[email protected]', 'fixes': []}
{'commit_date': '2023-08-20', 'author': '[email protected]', 'fixes': []}
{'commit_date': '2023-08-21', 'author': '[email protected]', 'fixes': []}

应用嵌套过滤器¶

在上述示例中，我们使用 AND 或 OR 组合了多个过滤器。我们还可以组合多组过滤器。

例如，在 SQL 中

WHERE (commit_date >= '2023-08-01' AND commit_date <= '2023-08-15') AND (author = '[email protected]' OR author = '[email protected]')

In [ ]

已复制!





filters = MetadataFilters(
    filters=[
        MetadataFilters(
            filters=[
                MetadataFilter(
                    key="commit_date", value="2023-08-01", operator=">="
                ),
                MetadataFilter(
                    key="commit_date", value="2023-08-15", operator="<="
                ),
            ],
            condition="and",
        ),
        MetadataFilters(
            filters=[
                MetadataFilter(key="author", value="[email protected]"),
                MetadataFilter(key="author", value="[email protected]"),
            ],
            condition="or",
        ),
    ],
    condition="and",
)

retriever = index.as_retriever(
    similarity_top_k=10,
    filters=filters,
)

retrieved_nodes = retriever.retrieve("What is this software project about?")

for node in retrieved_nodes:
    print(node.node.metadata)
filters = MetadataFilters( filters=[ MetadataFilters( filters=[ MetadataFilter( key="commit_date", value="2023-08-01", operator=">=" ), MetadataFilter( key="commit_date", value="2023-08-15", operator="<=" ), ], condition="and", ), MetadataFilters( filters=[ MetadataFilter(key="author", value="[email protected]"), MetadataFilter(key="author", value="[email protected]"), ], condition="or", ), ], condition="and", ) retriever = index.as_retriever( similarity_top_k=10, filters=filters, ) retrieved_nodes = retriever.retrieve("What is this software project about?") for node in retrieved_nodes: print(node.node.metadata)

{'commit_date': '2023-08-07', 'author': '[email protected]', 'fixes': []}
{'commit_date': '2023-08-07', 'author': '[email protected]', 'fixes': []}
{'commit_date': '2023-08-15', 'author': '[email protected]', 'fixes': []}
{'commit_date': '2023-08-10', 'author': '[email protected]', 'fixes': []}

上述内容可以通过使用 IN 运算符来简化。PGVectorStore 支持 in、nin 和 contains 用于将元素与列表进行比较。

In [ ]

已复制!





filters = MetadataFilters(
    filters=[
        MetadataFilter(key="commit_date", value="2023-08-01", operator=">="),
        MetadataFilter(key="commit_date", value="2023-08-15", operator="<="),
        MetadataFilter(
            key="author",
            value=["[email protected]", "[email protected]"],
            operator="in",
        ),
    ],
    condition="and",
)

retriever = index.as_retriever(
    similarity_top_k=10,
    filters=filters,
)

retrieved_nodes = retriever.retrieve("What is this software project about?")

for node in retrieved_nodes:
    print(node.node.metadata)
filters = MetadataFilters( filters=[ MetadataFilter(key="commit_date", value="2023-08-01", operator=">="), MetadataFilter(key="commit_date", value="2023-08-15", operator="<="), MetadataFilter( key="author", value=["[email protected]", "[email protected]"], operator="in", ), ], condition="and", ) retriever = index.as_retriever( similarity_top_k=10, filters=filters, ) retrieved_nodes = retriever.retrieve("What is this software project about?") for node in retrieved_nodes: print(node.node.metadata)

{'commit_date': '2023-08-07', 'author': '[email protected]', 'fixes': []}
{'commit_date': '2023-08-07', 'author': '[email protected]', 'fixes': []}
{'commit_date': '2023-08-15', 'author': '[email protected]', 'fixes': []}
{'commit_date': '2023-08-10', 'author': '[email protected]', 'fixes': []}

In [ ]

已复制!





# Same thing, with NOT IN
filters = MetadataFilters(
    filters=[
        MetadataFilter(key="commit_date", value="2023-08-01", operator=">="),
        MetadataFilter(key="commit_date", value="2023-08-15", operator="<="),
        MetadataFilter(
            key="author",
            value=["[email protected]", "[email protected]"],
            operator="nin",
        ),
    ],
    condition="and",
)

retriever = index.as_retriever(
    similarity_top_k=10,
    filters=filters,
)

retrieved_nodes = retriever.retrieve("What is this software project about?")

for node in retrieved_nodes:
    print(node.node.metadata)
# 同样，使用 NOT IN filters = MetadataFilters( filters=[ MetadataFilter(key="commit_date", value="2023-08-01", operator=">="), MetadataFilter(key="commit_date", value="2023-08-15", operator="<="), MetadataFilter( key="author", value=["[email protected]", "[email protected]"], operator="nin", ), ], condition="and", ) retriever = index.as_retriever( similarity_top_k=10, filters=filters, ) retrieved_nodes = retriever.retrieve("What is this software project about?") for node in retrieved_nodes: print(node.node.metadata)

{'commit_date': '2023-08-09', 'author': '[email protected]', 'fixes': ['5805']}
{'commit_date': '2023-08-15', 'author': '[email protected]', 'fixes': []}
{'commit_date': '2023-08-15', 'author': '[email protected]', 'fixes': []}
{'commit_date': '2023-08-11', 'author': '[email protected]', 'fixes': []}
{'commit_date': '2023-08-03', 'author': '[email protected]', 'fixes': []}
{'commit_date': '2023-08-09', 'author': '[email protected]', 'fixes': ['5923', '5680', '5774', '5786', '5906', '5912']}
{'commit_date': '2023-08-03', 'author': '[email protected]', 'fixes': ['5908']}
{'commit_date': '2023-08-01', 'author': '[email protected]', 'fixes': []}
{'commit_date': '2023-08-10', 'author': '[email protected]', 'fixes': []}
{'commit_date': '2023-08-10', 'author': '[email protected]', 'fixes': []}

In [ ]

已复制!





# CONTAINS
filters = MetadataFilters(
    filters=[
        MetadataFilter(key="fixes", value="5680", operator="contains"),
    ]
)

retriever = index.as_retriever(
    similarity_top_k=10,
    filters=filters,
)

retrieved_nodes = retriever.retrieve("How did these commits fix the issue?")
for node in retrieved_nodes:
    print(node.node.metadata)
# CONTAINS filters = MetadataFilters( filters=[ MetadataFilter(key="fixes", value="5680", operator="contains"), ] ) retriever = index.as_retriever( similarity_top_k=10, filters=filters, ) retrieved_nodes = retriever.retrieve("How did these commits fix the issue?") for node in retrieved_nodes: print(node.node.metadata)

{'commit_date': '2023-08-09', 'author': '[email protected]', 'fixes': ['5923', '5680', '5774', '5786', '5906', '5912']}

PgVector 查询选项¶

IVFFlat Probes¶

指定 IVFFlat probes 的数量（默认为 1）

从索引中检索时，您可以指定适当数量的 IVFFlat probes（值越高，召回率越好；值越低，速度越快）

In [ ]

已复制!

retriever = index.as_retriever(
    vector_store_query_mode="hybrid",
    similarity_top_k=5,
    vector_store_kwargs={"ivfflat_probes": 10},
)
retriever = index.as_retriever( vector_store_query_mode="hybrid", similarity_top_k=5, vector_store_kwargs={"ivfflat_probes": 10}, )

HNSW EF Search¶

指定搜索的动态候选列表的大小（默认为 40）

In [ ]

已复制!

retriever = index.as_retriever(
    vector_store_query_mode="hybrid",
    similarity_top_k=5,
    vector_store_kwargs={"hnsw_ef_search": 300},
)
retriever = index.as_retriever( vector_store_query_mode="hybrid", similarity_top_k=5, vector_store_kwargs={"hnsw_ef_search": 300}, )