使用自动检索（结合 Pinecone + Arize Phoenix）的简单到高级指南¶

在本 Notebook 中，我们将展示如何对 Pinecone 执行自动检索，它使您能够执行超越标准 top-k 语义搜索的各种半结构化查询。

我们将展示如何设置基础自动检索，以及如何扩展它（通过自定义 prompt 和动态元数据检索）。

如果您在 colab 上打开此 Notebook，您可能需要安装 LlamaIndex 🦙。

In [ ]

已复制！

%pip install llama-index-vector-stores-pinecone
%pip install llama-index-vector-stores-pinecone

In [ ]

已复制！

# !pip install llama-index>=0.9.31 scikit-learn==1.2.2 arize-phoenix==2.4.1 pinecone-client>=3.0.0
# !pip install llama-index>=0.9.31 scikit-learn==1.2.2 arize-phoenix==2.4.1 pinecone-client>=3.0.0

第 1 部分：设置自动检索¶

要设置自动检索，请执行以下操作：

我们将进行一些设置，加载数据，构建一个 Pinecone 向量索引。
我们将定义自动检索器并运行一些示例查询。
我们将使用 Phoenix 来观察每个追踪并可视化 prompt 的输入/输出。
我们将向您展示如何自定义自动检索 prompt。

1.a 设置 Pinecone/Phoenix，加载数据，并构建向量索引¶

在本节中，我们将设置 Pinecone 并导入一些关于书籍/电影的玩具数据（包含文本数据和元数据）。

我们还将设置 Phoenix 以捕获下游追踪。

In [ ]

已复制！

# setup Phoenix
import phoenix as px
import llama_index.core

px.launch_app()
llama_index.core.set_global_handler("arize_phoenix")
# setup Phoenix import phoenix as px import llama_index.core px.launch_app() llama_index.core.set_global_handler("arize_phoenix")

🌍 To view the Phoenix app in your browser, visit http://127.0.0.1:6006/
📺 To view the Phoenix app in a notebook, run `px.active_session().view()`
📖 For more information on how to use Phoenix, check out https://docs.arize.com/phoenix

In [ ]

已复制！

import os

os.environ[
    "PINECONE_API_KEY"
] = "<Your Pinecone API key, from app.pinecone.io>"
# os.environ["OPENAI_API_KEY"] = "sk-..."
import os os.environ[ "PINECONE_API_KEY" ] = "" # os.environ["OPENAI_API_KEY"] = "sk-..."

In [ ]

已复制！

from pinecone import Pinecone
from pinecone import ServerlessSpec

api_key = os.environ["PINECONE_API_KEY"]
pc = Pinecone(api_key=api_key)
from pinecone import Pinecone from pinecone import ServerlessSpec api_key = os.environ["PINECONE_API_KEY"] pc = Pinecone(api_key=api_key)

In [ ]

已复制！

# delete if needed
# pc.delete_index("quickstart-index")
# delete if needed # pc.delete_index("quickstart-index")

In [ ]

已复制！





# Dimensions are for text-embedding-ada-002
try:
    pc.create_index(
        "quickstart-index",
        dimension=1536,
        metric="euclidean",
        spec=ServerlessSpec(cloud="aws", region="us-west-2"),
    )
except Exception as e:
    # Most likely index already exists
    print(e)
    pass
# Dimensions are for text-embedding-ada-002 try: pc.create_index( "quickstart-index", dimension=1536, metric="euclidean", spec=ServerlessSpec(cloud="aws", region="us-west-2"), ) except Exception as e: # Most likely index already exists print(e) pass

In [ ]

已复制！

pinecone_index = pc.Index("quickstart-index")
pinecone_index = pc.Index("quickstart-index")

加载文档，构建 PineconeVectorStore 和 VectorStoreIndex¶

In [ ]

已复制！

from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.vector_stores.pinecone import PineconeVectorStore
from llama_index.core import VectorStoreIndex, StorageContext from llama_index.vector_stores.pinecone import PineconeVectorStore

In [ ]

已复制！





from llama_index.core.schema import TextNode

nodes = [
    TextNode(
        text="The Shawshank Redemption",
        metadata={
            "author": "Stephen King",
            "theme": "Friendship",
            "year": 1994,
        },
    ),
    TextNode(
        text="The Godfather",
        metadata={
            "director": "Francis Ford Coppola",
            "theme": "Mafia",
            "year": 1972,
        },
    ),
    TextNode(
        text="Inception",
        metadata={
            "director": "Christopher Nolan",
            "theme": "Fiction",
            "year": 2010,
        },
    ),
    TextNode(
        text="To Kill a Mockingbird",
        metadata={
            "author": "Harper Lee",
            "theme": "Fiction",
            "year": 1960,
        },
    ),
    TextNode(
        text="1984",
        metadata={
            "author": "George Orwell",
            "theme": "Totalitarianism",
            "year": 1949,
        },
    ),
    TextNode(
        text="The Great Gatsby",
        metadata={
            "author": "F. Scott Fitzgerald",
            "theme": "The American Dream",
            "year": 1925,
        },
    ),
    TextNode(
        text="Harry Potter and the Sorcerer's Stone",
        metadata={
            "author": "J.K. Rowling",
            "theme": "Fiction",
            "year": 1997,
        },
    ),
]
from llama_index.core.schema import TextNode nodes = [ TextNode( text="The Shawshank Redemption", metadata={ "author": "Stephen King", "theme": "Friendship", "year": 1994, }, ), TextNode( text="The Godfather", metadata={ "director": "Francis Ford Coppola", "theme": "Mafia", "year": 1972, }, ), TextNode( text="Inception", metadata={ "director": "Christopher Nolan", "theme": "Fiction", "year": 2010, }, ), TextNode( text="To Kill a Mockingbird", metadata={ "author": "Harper Lee", "theme": "Fiction", "year": 1960, }, ), TextNode( text="1984", metadata={ "author": "George Orwell", "theme": "Totalitarianism", "year": 1949, }, ), TextNode( text="The Great Gatsby", metadata={ "author": "F. Scott Fitzgerald", "theme": "The American Dream", "year": 1925, }, ), TextNode( text="Harry Potter and the Sorcerer's Stone", metadata={ "author": "J.K. Rowling", "theme": "Fiction", "year": 1997, }, ), ]

In [ ]

已复制！

vector_store = PineconeVectorStore(
    pinecone_index=pinecone_index,
    namespace="test",
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
vector_store = PineconeVectorStore( pinecone_index=pinecone_index, namespace="test", ) storage_context = StorageContext.from_defaults(vector_store=vector_store)

In [ ]

已复制！

index = VectorStoreIndex(nodes, storage_context=storage_context)
index = VectorStoreIndex(nodes, storage_context=storage_context)

Upserted vectors:   0%|          | 0/7 [00:00<?, ?it/s]

1.b 定义 Autoretriever，运行一些示例查询¶

设置 `VectorIndexAutoRetriever`¶

其中一个输入是描述向量存储集合包含什么内容的 schema。这类似于描述 SQL 数据库中表的表 schema。然后将此 schema 信息注入到 prompt 中，再传递给 LLM 以推断完整的查询应该是什么（包括元数据过滤器）。

In [ ]

已复制！





from llama_index.core.retrievers import VectorIndexAutoRetriever
from llama_index.core.vector_stores import MetadataInfo, VectorStoreInfo


vector_store_info = VectorStoreInfo(
    content_info="famous books and movies",
    metadata_info=[
        MetadataInfo(
            name="director",
            type="str",
            description=("Name of the director"),
        ),
        MetadataInfo(
            name="theme",
            type="str",
            description=("Theme of the book/movie"),
        ),
        MetadataInfo(
            name="year",
            type="int",
            description=("Year of the book/movie"),
        ),
    ],
)
retriever = VectorIndexAutoRetriever(
    index,
    vector_store_info=vector_store_info,
    empty_query_top_k=10,
    # this is a hack to allow for blank queries in pinecone
    default_empty_query_vector=[0] * 1536,
    verbose=True,
)
from llama_index.core.retrievers import VectorIndexAutoRetriever from llama_index.core.vector_stores import MetadataInfo, VectorStoreInfo vector_store_info = VectorStoreInfo( content_info="famous books and movies", metadata_info=[ MetadataInfo( name="director", type="str", description=("Name of the director"), ), MetadataInfo( name="theme", type="str", description=("Theme of the book/movie"), ), MetadataInfo( name="year", type="int", description=("Year of the book/movie"), ), ], ) retriever = VectorIndexAutoRetriever( index, vector_store_info=vector_store_info, empty_query_top_k=10, # this is a hack to allow for blank queries in pinecone default_empty_query_vector=[0] * 1536, verbose=True, )

运行一些查询¶

让我们运行一些利用结构化信息的示例查询。

In [ ]

已复制！

nodes = retriever.retrieve(
    "Tell me about some books/movies after the year 2000"
)
nodes = retriever.retrieve( "Tell me about some books/movies after the year 2000" )

Using query str: 
Using filters: [('year', '>', 2000)]

In [ ]

已复制！

for node in nodes:
    print(node.text)
    print(node.metadata)
for node in nodes: print(node.text) print(node.metadata)

Inception
{'director': 'Christopher Nolan', 'theme': 'Fiction', 'year': 2010}

In [ ]

已复制！

nodes = retriever.retrieve("Tell me about some books that are Fiction")
nodes = retriever.retrieve("Tell me about some books that are Fiction")

Using query str: Fiction
Using filters: [('theme', '==', 'Fiction')]

In [ ]

已复制！

for node in nodes:
    print(node.text)
    print(node.metadata)
for node in nodes: print(node.text) print(node.metadata)

Inception
{'director': 'Christopher Nolan', 'theme': 'Fiction', 'year': 2010}
To Kill a Mockingbird
{'author': 'Harper Lee', 'theme': 'Fiction', 'year': 1960}

传入附加元数据过滤器¶

如果您有希望传入但未自动推断出的附加元数据过滤器，请执行以下操作。

In [ ]

已复制！





from llama_index.core.vector_stores import MetadataFilters

filter_dicts = [{"key": "year", "operator": "==", "value": 1997}]
filters = MetadataFilters.from_dicts(filter_dicts)
retriever2 = VectorIndexAutoRetriever(
    index,
    vector_store_info=vector_store_info,
    empty_query_top_k=10,
    # this is a hack to allow for blank queries in pinecone
    default_empty_query_vector=[0] * 1536,
    extra_filters=filters,
)
from llama_index.core.vector_stores import MetadataFilters filter_dicts = [{"key": "year", "operator": "==", "value": 1997}] filters = MetadataFilters.from_dicts(filter_dicts) retriever2 = VectorIndexAutoRetriever( index, vector_store_info=vector_store_info, empty_query_top_k=10, # this is a hack to allow for blank queries in pinecone default_empty_query_vector=[0] * 1536, extra_filters=filters, )

In [ ]

已复制！

nodes = retriever2.retrieve("Tell me about some books that are Fiction")
for node in nodes:
    print(node.text)
    print(node.metadata)
nodes = retriever2.retrieve("Tell me about some books that are Fiction") for node in nodes: print(node.text) print(node.metadata)

Harry Potter and the Sorcerer's Stone
{'author': 'J.K. Rowling', 'theme': 'Fiction', 'year': 1997}

查询失败示例¶

请注意，没有检索到结果！我们稍后会修复这个问题。

In [ ]

已复制！

nodes = retriever.retrieve("Tell me about some books that are mafia-themed")
nodes = retriever.retrieve("Tell me about some books that are mafia-themed")

Using query str: books
Using filters: [('theme', '==', 'mafia')]

In [ ]

已复制！

for node in nodes:
    print(node.text)
    print(node.metadata)
for node in nodes: print(node.text) print(node.metadata)

可视化追踪¶

让我们打开 Phoenix 来看看追踪！

No description has been provided for this image

让我们看看自动检索 prompt。我们看到自动检索 prompt 利用了两个 few-shot 示例。

第 2 部分：扩展自动检索（通过动态元数据检索）¶

现在，我们通过自定义 prompt 来扩展自动检索。在第一部分，我们明确添加一些规则。

在第二部分，我们实现动态元数据检索，它将执行第一阶段检索，从向量数据库中获取相关元数据，并将其作为 few-shot 示例插入到自动检索 prompt 中。（当然，第二阶段检索会从向量数据库中检索实际项目）。

2.a 改进自动检索 Prompt¶

我们的自动检索 prompt 可以工作，但可以通过多种方式改进。一些例子包括它包含了 2 个硬编码的 few-shot 示例（如何包含您自己的示例？），以及自动检索并非“总是”推断出正确的元数据过滤器。

例如，所有的 theme 字段都是首字母大写的。我们如何告诉 LLM 这一点，以免它错误地推断出小写的“theme”？

让我们尝试修改一下 prompt！

In [ ]

已复制！

from llama_index.core.prompts import display_prompt_dict
from llama_index.core import PromptTemplate
from llama_index.core.prompts import display_prompt_dict from llama_index.core import PromptTemplate

In [ ]

已复制！

prompts_dict = retriever.get_prompts()
prompts_dict = retriever.get_prompts()

In [ ]

已复制！

display_prompt_dict(prompts_dict)
display_prompt_dict(prompts_dict)

In [ ]

已复制！

# look at required template variables.
prompts_dict["prompt"].template_vars
# look at required template variables. prompts_dict["prompt"].template_vars

Out [ ]

['schema_str', 'info_str', 'query_str']

自定义 Prompt¶

让我们稍微自定义一下 prompt。我们执行以下操作：

移除第一个 few-shot 示例以节省 tokens
添加一条消息，如果推断“theme”，则总是将首字母大写。

注意，prompt 模板需要定义 schema_str、info_str 和 query_str。

In [ ]

已复制！





# write prompt template, and modify it.

prompt_tmpl_str = """\
Your goal is to structure the user's query to match the request schema provided below.

<< Structured Request Schema >>
When responding use a markdown code snippet with a JSON object formatted in the following schema:

{schema_str}

The query string should contain only text that is expected to match the contents of documents. Any conditions in the filter should not be mentioned in the query as well.

Make sure that filters only refer to attributes that exist in the data source.
Make sure that filters take into account the descriptions of attributes.
Make sure that filters are only used as needed. If there are no filters that should be applied return [] for the filter value.
If the user's query explicitly mentions number of documents to retrieve, set top_k to that number, otherwise do not set top_k.
Do NOT EVER infer a null value for a filter. This will break the downstream program. Instead, don't include the filter.

<< Example 1. >>
Data Source:
```json
{{
    "metadata_info": [
        {{
            "name": "author",
            "type": "str",
            "description": "Author name"
        }},
        {{
            "name": "book_title",
            "type": "str",
            "description": "Book title"
        }},
        {{
            "name": "year",
            "type": "int",
            "description": "Year Published"
        }},
        {{
            "name": "pages",
            "type": "int",
            "description": "Number of pages"
        }},
        {{
            "name": "summary",
            "type": "str",
            "description": "A short summary of the book"
        }}
    ],
    "content_info": "Classic literature"
}}
```

User Query:
What are some books by Jane Austen published after 1813 that explore the theme of marriage for social standing?

Additional Instructions:
None

Structured Request:
```json
{{"query": "Books related to theme of marriage for social standing", "filters": [{{"key": "year", "value": "1813", "operator": ">"}}, {{"key": "author", "value": "Jane Austen", "operator": "=="}}], "top_k": null}}

```

<< Example 2. >>
Data Source:
```json
{info_str}
```

User Query:
{query_str}

Additional Instructions:
{additional_instructions}

Structured Request:
"""
# write prompt template, and modify it. prompt_tmpl_str = """\ Your goal is to structure the user's query to match the request schema provided below. << Structured Request Schema >> When responding use a markdown code snippet with a JSON object formatted in the following schema: {schema_str} The query string should contain only text that is expected to match the contents of documents. Any conditions in the filter should not be mentioned in the query as well. Make sure that filters only refer to attributes that exist in the data source. Make sure that filters take into account the descriptions of attributes. Make sure that filters are only used as needed. If there are no filters that should be applied return [] for the filter value. If the user's query explicitly mentions number of documents to retrieve, set top_k to that number, otherwise do not set top_k. Do NOT EVER infer a null value for a filter. This will break the downstream program. Instead, don't include the filter. << Example 1. >> Data Source: ```json {{ "metadata_info": [ {{ "name": "author", "type": "str", "description": "Author name" }}, {{ "name": "book_title", "type": "str", "description": "Book title" }}, {{ "name": "year", "type": "int", "description": "Year Published" }}, {{ "name": "pages", "type": "int", "description": "Number of pages" }}, {{ "name": "summary", "type": "str", "description": "A short summary of the book" }} ], "content_info": "Classic literature" }} ``` User Query: What are some books by Jane Austen published after 1813 that explore the theme of marriage for social standing? Additional Instructions: None Structured Request: ```json {{"query": "Books related to theme of marriage for social standing", "filters": [{{"key": "year", "value": "1813", "operator": ">"}}, {{"key": "author", "value": "Jane Austen", "operator": "=="}}], "top_k": null}} ``` << Example 2. >> Data Source: ```json {info_str} ``` User Query: {query_str} Additional Instructions: {additional_instructions} Structured Request: """

In [ ]

已复制！

prompt_tmpl = PromptTemplate(prompt_tmpl_str)
prompt_tmpl = PromptTemplate(prompt_tmpl_str)

您会注意到我们添加了一个 additional_instructions 模板变量。这使我们能够插入特定于向量集合的说明。

我们将使用 partial_format 添加说明。

In [ ]

已复制！

add_instrs = """\
If one of the filters is 'theme', please make sure that the first letter of the inferred value is capitalized. Only words that are capitalized are valid values for "theme". \
"""
prompt_tmpl = prompt_tmpl.partial_format(additional_instructions=add_instrs)
add_instrs = """\ If one of the filters is 'theme', please make sure that the first letter of the inferred value is capitalized. Only words that are capitalized are valid values for "theme". \ """ prompt_tmpl = prompt_tmpl.partial_format(additional_instructions=add_instrs)

In [ ]

已复制！

retriever.update_prompts({"prompt": prompt_tmpl})
retriever.update_prompts({"prompt": prompt_tmpl})

重新运行一些查询¶

现在让我们尝试重新运行一些查询，您将看到该值是自动推断出来的。

In [ ]

已复制！

nodes = retriever.retrieve(
    "Tell me about some books that are friendship-themed"
)
nodes = retriever.retrieve( "Tell me about some books that are friendship-themed" )

In [ ]

已复制！

for node in nodes:
    print(node.text)
    print(node.metadata)
for node in nodes: print(node.text) print(node.metadata)

2.b 实现动态元数据检索¶

除了在 prompt 中硬编码规则外，另一种选择是检索相关的元数据 few-shot 示例，以帮助 LLM 更好地推断正确的元数据过滤器。

这将更好地防止 LLM 在推断“where”子句时出错，尤其是在值的拼写/正确格式等方面。

我们可以通过向量检索来实现这一点。现有的向量数据库集合存储原始文本 + 元数据；我们可以直接查询此集合，或者单独仅对元数据进行索引并从中检索。在本节中，我们选择前者，但在实践中您可能希望选择后者。

In [ ]

已复制！

# define retriever that fetches the top 2 examples.
metadata_retriever = index.as_retriever(similarity_top_k=2)
# define retriever that fetches the top 2 examples. metadata_retriever = index.as_retriever(similarity_top_k=2)

我们使用上一节中定义的相同 prompt_tmpl_str。

In [ ]

已复制！





from typing import List, Any


def format_additional_instrs(**kwargs: Any) -> str:
    """Format examples into a string."""

    nodes = metadata_retriever.retrieve(kwargs["query_str"])
    context_str = (
        "Here is the metadata of relevant entries from the database collection. "
        "This should help you infer the right filters: \n"
    )
    for node in nodes:
        context_str += str(node.node.metadata) + "\n"
    return context_str


ext_prompt_tmpl = PromptTemplate(
    prompt_tmpl_str,
    function_mappings={"additional_instructions": format_additional_instrs},
)
from typing import List, Any def format_additional_instrs(**kwargs: Any) -> str: """Format examples into a string.""" nodes = metadata_retriever.retrieve(kwargs["query_str"]) context_str = ( "Here is the metadata of relevant entries from the database collection. " "This should help you infer the right filters: \n" ) for node in nodes: context_str += str(node.node.metadata) + "\n" return context_str ext_prompt_tmpl = PromptTemplate( prompt_tmpl_str, function_mappings={"additional_instructions": format_additional_instrs}, )

In [ ]

已复制！

retriever.update_prompts({"prompt": ext_prompt_tmpl})
retriever.update_prompts({"prompt": ext_prompt_tmpl})

重新运行一些查询¶

现在让我们尝试重新运行一些查询，您将看到该值是自动推断出来的。

In [ ]

已复制！

nodes = retriever.retrieve("Tell me about some books that are mafia-themed")
for node in nodes:
    print(node.text)
    print(node.metadata)
nodes = retriever.retrieve("Tell me about some books that are mafia-themed") for node in nodes: print(node.text) print(node.metadata)

Using query str: books
Using filters: [('theme', '==', 'Mafia')]
The Godfather
{'director': 'Francis Ford Coppola', 'theme': 'Mafia', 'year': 1972}

In [ ]

已复制！

nodes = retriever.retrieve("Tell me some books authored by HARPER LEE")
for node in nodes:
    print(node.text)
    print(node.metadata)
nodes = retriever.retrieve("Tell me some books authored by HARPER LEE") for node in nodes: print(node.text) print(node.metadata)

Using query str: Books authored by Harper Lee
Using filters: [('author', '==', 'Harper Lee')]
To Kill a Mockingbird
{'author': 'Harper Lee', 'theme': 'Fiction', 'year': 1960}

使用自动检索（结合 Pinecone + Arize Phoenix）的简单到高级指南¶

第 1 部分：设置自动检索¶

1.a 设置 Pinecone/Phoenix，加载数据，并构建向量索引¶

加载文档，构建 PineconeVectorStore 和 VectorStoreIndex¶

1.b 定义 Autoretriever，运行一些示例查询¶

设置 VectorIndexAutoRetriever¶

运行一些查询¶

传入附加元数据过滤器¶

查询失败示例¶

可视化追踪¶

第 2 部分：扩展自动检索（通过动态元数据检索）¶

2.a 改进自动检索 Prompt¶

自定义 Prompt¶

重新运行一些查询¶

2.b 实现动态元数据检索¶

重新运行一些查询¶

设置 `VectorIndexAutoRetriever`¶