从 Weaviate 向量数据库进行自动检索¶

本指南展示了如何在 LlamaIndex 中使用 Weaviate 执行自动检索。

Weaviate 向量数据库除语义搜索的查询字符串外，还支持一组元数据过滤器。给定一个自然语言查询，我们首先使用大型语言模型 (LLM) 推断出一组元数据过滤器以及要传递给向量数据库的正确查询字符串（两者都可以为空）。然后对向量数据库执行这个整体查询包。

这使得检索形式比 top-k 语义搜索更具动态性和表达力。给定查询的相关上下文可能只需要按元数据标签过滤，或者需要在过滤集内进行过滤 + 语义搜索的联合组合，或者只是原始的语义搜索。

设置¶

我们首先定义导入并定义一个空的 Weaviate 集合。

如果您在 Colab 上打开此 Notebook，您可能需要安装 LlamaIndex 🦙。

In [ ]

已复制！

%pip install llama-index-vector-stores-weaviate
%pip install llama-index-vector-stores-weaviate

In [ ]

已复制！

!pip install llama-index weaviate-client
!pip install llama-index weaviate-client

In [ ]

已复制！

import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
import logging import sys logging.basicConfig(stream=sys.stdout, level=logging.INFO) logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

我们将使用 GPT-4 的推理能力来推断元数据过滤器。根据您的用例，"gpt-3.5-turbo" 也可以工作。

In [ ]

已复制！

# set up OpenAI
import os
import getpass
import openai

os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")
openai.api_key = os.environ["OPENAI_API_KEY"]
# 设置 OpenAI import os import getpass import openai os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:") openai.api_key = os.environ["OPENAI_API_KEY"]

In [ ]

已复制！

from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.core.settings import Settings

Settings.llm = OpenAI(model="gpt-4")
Settings.embed_model = OpenAIEmbedding()
from llama_index.embeddings.openai import OpenAIEmbedding from llama_index.llms.openai import OpenAI from llama_index.core.settings import Settings Settings.llm = OpenAI(model="gpt-4") Settings.embed_model = OpenAIEmbedding()

本 Notebook 在 Embedded 模式下使用 Weaviate，该模式支持 Linux 和 macOS。

如果您希望试用 Weaviate 的全托管服务 Weaviate Cloud Services (WCS)，可以取消注释代码。

In [ ]

已复制！





import weaviate
from weaviate.embedded import EmbeddedOptions

# Connect to Weaviate client in embedded mode
client = weaviate.connect_to_embedded()

# Enable this code if you want to use Weaviate Cloud Services instead of Embedded mode.
"""
import weaviate

# cloud
cluster_url = ""
api_key = ""

client = weaviate.connect_to_wcs(cluster_url=cluster_url,
    auth_credentials=weaviate.auth.AuthApiKey(api_key), 
)

# local
# client = weaviate.connect_to_local()
"""
import weaviate from weaviate.embedded import EmbeddedOptions # 在嵌入模式下连接到 Weaviate 客户端 client = weaviate.connect_to_embedded() # 如果您想使用 Weaviate Cloud Services 而不是嵌入模式，请启用此代码。 """ import weaviate # 云端 cluster_url = "" api_key = "" client = weaviate.connect_to_wcs(cluster_url=cluster_url, auth_credentials=weaviate.auth.AuthApiKey(api_key), ) # 本地 # client = weaviate.connect_to_local() """

定义一些示例数据¶

我们插入一些包含文本块的示例节点到向量数据库中。请注意，每个 TextNode 不仅包含文本，还包含元数据，例如 category 和 country。这些元数据字段将在底层向量数据库中被转换/存储。

In [ ]

已复制！





from llama_index.core.schema import TextNode

nodes = [
    TextNode(
        text=(
            "Michael Jordan is a retired professional basketball player,"
            " widely regarded as one of the greatest basketball players of all"
            " time."
        ),
        metadata={
            "category": "Sports",
            "country": "United States",
        },
    ),
    TextNode(
        text=(
            "Angelina Jolie is an American actress, filmmaker, and"
            " humanitarian. She has received numerous awards for her acting"
            " and is known for her philanthropic work."
        ),
        metadata={
            "category": "Entertainment",
            "country": "United States",
        },
    ),
    TextNode(
        text=(
            "Elon Musk is a business magnate, industrial designer, and"
            " engineer. He is the founder, CEO, and lead designer of SpaceX,"
            " Tesla, Inc., Neuralink, and The Boring Company."
        ),
        metadata={
            "category": "Business",
            "country": "United States",
        },
    ),
    TextNode(
        text=(
            "Rihanna is a Barbadian singer, actress, and businesswoman. She"
            " has achieved significant success in the music industry and is"
            " known for her versatile musical style."
        ),
        metadata={
            "category": "Music",
            "country": "Barbados",
        },
    ),
    TextNode(
        text=(
            "Cristiano Ronaldo is a Portuguese professional footballer who is"
            " considered one of the greatest football players of all time. He"
            " has won numerous awards and set multiple records during his"
            " career."
        ),
        metadata={
            "category": "Sports",
            "country": "Portugal",
        },
    ),
]
from llama_index.core.schema import TextNode nodes = [ TextNode( text=( "迈克尔·乔丹是一位已退役的职业篮球运动员，" "被广泛认为是史上最伟大的篮球运动员之一。" ), metadata={ "category": "体育", "country": "美国", }, ), TextNode( text=( "安吉丽娜·朱莉是一位美国女演员、电影制片人和" "人道主义者。她因其表演获得了无数奖项，并以她的慈善工作而闻名。" ), metadata={ "category": "娱乐", "country": "美国", }, ), TextNode( text=( "埃隆·马斯克是一位商业大亨、工业设计师和" "工程师。他是 SpaceX、特斯拉公司、Neuralink 和 The Boring Company 的创始人、首席执行官和首席设计师。" ), metadata={ "category": "商业", "country": "美国", }, ), TextNode( text=( "蕾哈娜是巴巴多斯歌手、女演员和商人。她" "在音乐界取得了巨大成功，并以其多变的音乐风格而闻名。" ), metadata={ "category": "音乐", "country": "巴巴多斯", }, ), TextNode( text=( "克里斯蒂亚诺·罗纳尔多是一名葡萄牙职业足球运动员，被" "认为是史上最伟大的足球运动员之一。他在职业生涯中" "赢得了无数奖项并创造了多项记录。" ), metadata={ "category": "体育", "country": "葡萄牙", }, ), ]

使用 Weaviate 向量存储构建向量索引¶

在这里，我们将数据加载到向量存储中。如上所述，每个节点的文本和元数据都将被转换为 Weaviate 中的相应表示。我们现在可以从 Weaviate 对这些数据执行语义查询和元数据过滤。

In [ ]

已复制！

from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.vector_stores.weaviate import WeaviateVectorStore

vector_store = WeaviateVectorStore(
    weaviate_client=client, index_name="LlamaIndex_filter"
)

storage_context = StorageContext.from_defaults(vector_store=vector_store)
from llama_index.core import VectorStoreIndex, StorageContext from llama_index.vector_stores.weaviate import WeaviateVectorStore vector_store = WeaviateVectorStore( weaviate_client=client, index_name="LlamaIndex_filter" ) storage_context = StorageContext.from_defaults(vector_store=vector_store)

In [ ]

已复制！

index = VectorStoreIndex(nodes, storage_context=storage_context)
index = VectorStoreIndex(nodes, storage_context=storage_context)

定义 `VectorIndexAutoRetriever`¶

我们定义核心模块 VectorIndexAutoRetriever。该模块接收 VectorStoreInfo，其中包含向量存储集合及其支持的元数据过滤器的结构化描述。这些信息将用于自动检索提示中，LLM 将在其中推断元数据过滤器。

In [ ]

已复制！





from llama_index.core.retrievers import VectorIndexAutoRetriever
from llama_index.core.vector_stores.types import MetadataInfo, VectorStoreInfo


vector_store_info = VectorStoreInfo(
    content_info="brief biography of celebrities",
    metadata_info=[
        MetadataInfo(
            name="category",
            type="str",
            description=(
                "Category of the celebrity, one of [Sports, Entertainment,"
                " Business, Music]"
            ),
        ),
        MetadataInfo(
            name="country",
            type="str",
            description=(
                "Country of the celebrity, one of [United States, Barbados,"
                " Portugal]"
            ),
        ),
    ],
)

retriever = VectorIndexAutoRetriever(
    index, vector_store_info=vector_store_info
)
from llama_index.core.retrievers import VectorIndexAutoRetriever from llama_index.core.vector_stores.types import MetadataInfo, VectorStoreInfo vector_store_info = VectorStoreInfo( content_info="名人的简短传记", metadata_info=[ MetadataInfo( name="category", type="str", description=( "名人类别，包括 [体育, 娱乐," " 商业, 音乐]" ), ), MetadataInfo( name="country", type="str", description=( "名人国家，包括 [美国, 巴巴多斯," " 葡萄牙]" ), ), ], ) retriever = VectorIndexAutoRetriever( index, vector_store_info=vector_store_info )

运行一些示例数据¶

我们尝试运行一些示例数据。请注意元数据过滤器是如何推断的 - 这有助于更精确的检索！

In [ ]

已复制！

response = retriever.retrieve("Tell me about celebrities from United States")
response = retriever.retrieve("告诉我来自美国的那些名人")

In [ ]

已复制！

print(response[0])
print(response[0])

In [ ]

已复制！

response = retriever.retrieve(
    "Tell me about Sports celebrities from United States"
)
response = retriever.retrieve( "告诉我来自美国的体育名人" )

In [ ]

已复制！

print(response[0])
print(response[0])

从 Weaviate 向量数据库进行自动检索¶

设置¶

定义一些示例数据¶

使用 Weaviate 向量存储构建向量索引¶

定义 VectorIndexAutoRetriever¶

运行一些示例数据¶

定义 `VectorIndexAutoRetriever`¶