从向量数据库进行自动检索¶

本指南展示了如何在 LlamaIndex 中执行 自动检索。

许多流行的向量数据库除了支持用于语义搜索的查询字符串外，还支持一组元数据过滤器。给定一个自然语言查询，我们首先使用 LLM 推断一组元数据过滤器以及传递给向量数据库的正确查询字符串（两者都可以为空）。然后，针对向量数据库执行这个整体查询包。

这使得检索形式更加动态、更具表现力，超越了 top-k 语义搜索。给定查询的相关上下文可能仅需要基于元数据标签进行过滤，或需要在过滤后的集合内进行过滤 + 语义搜索的联合组合，或仅是原始的语义搜索。

我们将以 Chroma 为例进行演示，但自动检索也已在许多其他向量数据库（例如 Pinecone、Weaviate 等）中实现。

设置¶

我们首先定义导入，并定义一个空的 Chroma 集合。

如果您在 colab 上打开此 Notebook，您可能需要安装 LlamaIndex 🦙。

In [ ]

已复制！

%pip install llama-index-vector-stores-chroma
%pip install llama-index-vector-stores-chroma

In [ ]

已复制！

!pip install llama-index
!pip install llama-index

In [ ]

已复制！

import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
import logging import sys logging.basicConfig(stream=sys.stdout, level=logging.INFO) logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

In [ ]

已复制！

# set up OpenAI
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")
import openai

openai.api_key = os.environ["OPENAI_API_KEY"]
# 设置 OpenAI import os import getpass os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:") import openai openai.api_key = os.environ["OPENAI_API_KEY"]

In [ ]

已复制！

import chromadb
import chromadb

In [ ]

已复制！

chroma_client = chromadb.EphemeralClient()
chroma_collection = chroma_client.create_collection("quickstart")
chroma_client = chromadb.EphemeralClient() chroma_collection = chroma_client.create_collection("quickstart")

INFO:chromadb.telemetry.posthog:Anonymized telemetry enabled. See https://docs.trychroma.com/telemetry for more information.
Anonymized telemetry enabled. See https://docs.trychroma.com/telemetry for more information.

定义一些示例数据¶

我们将包含文本块的示例节点插入到向量数据库中。请注意，每个 TextNode 不仅包含文本，还包含元数据，例如 category 和 country。这些元数据字段将在底层向量数据库中按原样进行转换/存储。

In [ ]

已复制！

from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import VectorStoreIndex, StorageContext from llama_index.vector_stores.chroma import ChromaVectorStore

In [ ]

已复制！





from llama_index.core.schema import TextNode

nodes = [
    TextNode(
        text=(
            "Michael Jordan is a retired professional basketball player,"
            " widely regarded as one of the greatest basketball players of all"
            " time."
        ),
        metadata={
            "category": "Sports",
            "country": "United States",
        },
    ),
    TextNode(
        text=(
            "Angelina Jolie is an American actress, filmmaker, and"
            " humanitarian. She has received numerous awards for her acting"
            " and is known for her philanthropic work."
        ),
        metadata={
            "category": "Entertainment",
            "country": "United States",
        },
    ),
    TextNode(
        text=(
            "Elon Musk is a business magnate, industrial designer, and"
            " engineer. He is the founder, CEO, and lead designer of SpaceX,"
            " Tesla, Inc., Neuralink, and The Boring Company."
        ),
        metadata={
            "category": "Business",
            "country": "United States",
        },
    ),
    TextNode(
        text=(
            "Rihanna is a Barbadian singer, actress, and businesswoman. She"
            " has achieved significant success in the music industry and is"
            " known for her versatile musical style."
        ),
        metadata={
            "category": "Music",
            "country": "Barbados",
        },
    ),
    TextNode(
        text=(
            "Cristiano Ronaldo is a Portuguese professional footballer who is"
            " considered one of the greatest football players of all time. He"
            " has won numerous awards and set multiple records during his"
            " career."
        ),
        metadata={
            "category": "Sports",
            "country": "Portugal",
        },
    ),
]
from llama_index.core.schema import TextNode nodes = [ TextNode( text=( "Michael Jordan is a retired professional basketball player," " widely regarded as one of the greatest basketball players of all" " time." ), metadata={ "category": "Sports", "country": "United States", }, ), TextNode( text=( "Angelina Jolie is an American actress, filmmaker, and" " humanitarian. She has received numerous awards for her acting" " and is known for her philanthropic work." ), metadata={ "category": "Entertainment", "country": "United States", }, ), TextNode( text=( "Elon Musk is a business magnate, industrial designer, and" " engineer. He is the founder, CEO, and lead designer of SpaceX," " Tesla, Inc., Neuralink, and The Boring Company." ), metadata={ "category": "Business", "country": "United States", }, ), TextNode( text=( "Rihanna is a Barbadian singer, actress, and businesswoman. She" " has achieved significant success in the music industry and is" " known for her versatile musical style." ), metadata={ "category": "Music", "country": "Barbados", }, ), TextNode( text=( "Cristiano Ronaldo is a Portuguese professional footballer who is" " considered one of the greatest football players of all time. He" " has won numerous awards and set multiple records during his" " career." ), metadata={ "category": "Sports", "country": "Portugal", }, ), ]

使用 Chroma 向量存储构建向量索引¶

在这里，我们将数据加载到向量存储中。如上所述，每个节点的文本和元数据都将在 Chroma 中被转换为相应的表示形式。现在，我们可以从 Chroma 对这些数据运行语义查询以及进行元数据过滤。

In [ ]

已复制！

vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
vector_store = ChromaVectorStore(chroma_collection=chroma_collection) storage_context = StorageContext.from_defaults(vector_store=vector_store)

In [ ]

已复制！

index = VectorStoreIndex(nodes, storage_context=storage_context)
index = VectorStoreIndex(nodes, storage_context=storage_context)

定义 `VectorIndexAutoRetriever`¶

我们定义了核心的 VectorIndexAutoRetriever 模块。该模块接受 VectorStoreInfo，其中包含向量存储集合及其支持的元数据过滤器的结构化描述。然后，这些信息将在自动检索提示中使用，以便 LLM 推断元数据过滤器。

In [ ]

已复制！





from llama_index.core.retrievers import VectorIndexAutoRetriever
from llama_index.core.vector_stores.types import MetadataInfo, VectorStoreInfo


vector_store_info = VectorStoreInfo(
    content_info="brief biography of celebrities",
    metadata_info=[
        MetadataInfo(
            name="category",
            type="str",
            description=(
                "Category of the celebrity, one of [Sports, Entertainment,"
                " Business, Music]"
            ),
        ),
        MetadataInfo(
            name="country",
            type="str",
            description=(
                "Country of the celebrity, one of [United States, Barbados,"
                " Portugal]"
            ),
        ),
    ],
)
retriever = VectorIndexAutoRetriever(
    index, vector_store_info=vector_store_info
)
from llama_index.core.retrievers import VectorIndexAutoRetriever from llama_index.core.vector_stores.types import MetadataInfo, VectorStoreInfo vector_store_info = VectorStoreInfo( content_info="名人的简要传记", metadata_info=[ MetadataInfo( name="category", type="str", description=( "名人类别，可选值有 [Sports, Entertainment," " Business, Music]" ), ), MetadataInfo( name="country", type="str", description=( "名人所在国家，可选值有 [United States, Barbados," " Portugal]" ), ), ], ) retriever = VectorIndexAutoRetriever( index, vector_store_info=vector_store_info )

运行一些示例数据¶

我们尝试运行一些示例数据。注意元数据过滤器是如何推断出来的——这有助于实现更精确的检索！

In [ ]

已复制！

retriever.retrieve("Tell me about two celebrities from United States")
retriever.retrieve("Tell me about two celebrities from United States")

INFO:llama_index.indices.vector_store.retrievers.auto_retriever.auto_retriever:Using query str: celebrities
Using query str: celebrities
INFO:llama_index.indices.vector_store.retrievers.auto_retriever.auto_retriever:Using filters: {'country': 'United States'}
Using filters: {'country': 'United States'}
INFO:llama_index.indices.vector_store.retrievers.auto_retriever.auto_retriever:Using top_k: 2
Using top_k: 2

Out[ ]

[NodeWithScore(node=TextNode(id_='b2ab3b1a-5731-41ec-b884-405016de5a34', embedding=None, metadata={'category': 'Entertainment', 'country': 'United States'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, hash='28e1d0d600908a5e9f0c388f0d49b0cd58920dc13e4f2743becd135ac0f18799', text='Angelina Jolie is an American actress, filmmaker, and humanitarian. She has received numerous awards for her acting and is known for her philanthropic work.', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'), score=0.32621567877748514),
 NodeWithScore(node=TextNode(id_='e0104b6a-676a-4c83-95b7-b018cb8b39b2', embedding=None, metadata={'category': 'Sports', 'country': 'United States'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, hash='7456e8d70b089c3830424e49b2a03c8d6d3f5cd0de42b0669a8ee518eca01012', text='Michael Jordan is a retired professional basketball player, widely regarded as one of the greatest basketball players of all time.', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'), score=0.3734030955060519)]

In [ ]

已复制！

retriever.retrieve("Tell me about Sports celebrities from United States")
retriever.retrieve("Tell me about Sports celebrities from United States")

从向量数据库进行自动检索¶

设置¶

定义一些示例数据¶

使用 Chroma 向量存储构建向量索引¶

定义 VectorIndexAutoRetriever¶

运行一些示例数据¶

定义 `VectorIndexAutoRetriever`¶