使用自动检索(结合 Pinecone + Arize Phoenix)的简单到高级指南¶
在本 Notebook 中,我们将展示如何对 Pinecone 执行自动检索,它使您能够执行超越标准 top-k 语义搜索的各种半结构化查询。
我们将展示如何设置基础自动检索,以及如何扩展它(通过自定义 prompt 和动态元数据检索)。
如果您在 colab 上打开此 Notebook,您可能需要安装 LlamaIndex 🦙。
%pip install llama-index-vector-stores-pinecone
# !pip install llama-index>=0.9.31 scikit-learn==1.2.2 arize-phoenix==2.4.1 pinecone-client>=3.0.0
第 1 部分:设置自动检索¶
要设置自动检索,请执行以下操作:
- 我们将进行一些设置,加载数据,构建一个 Pinecone 向量索引。
- 我们将定义自动检索器并运行一些示例查询。
- 我们将使用 Phoenix 来观察每个追踪并可视化 prompt 的输入/输出。
- 我们将向您展示如何自定义自动检索 prompt。
1.a 设置 Pinecone/Phoenix,加载数据,并构建向量索引¶
在本节中,我们将设置 Pinecone 并导入一些关于书籍/电影的玩具数据(包含文本数据和元数据)。
我们还将设置 Phoenix 以捕获下游追踪。
# setup Phoenix
import phoenix as px
import llama_index.core
px.launch_app()
llama_index.core.set_global_handler("arize_phoenix")
🌍 To view the Phoenix app in your browser, visit http://127.0.0.1:6006/ 📺 To view the Phoenix app in a notebook, run `px.active_session().view()` 📖 For more information on how to use Phoenix, check out https://docs.arize.com/phoenix
import os
os.environ[
"PINECONE_API_KEY"
] = "<Your Pinecone API key, from app.pinecone.io>"
# os.environ["OPENAI_API_KEY"] = "sk-..."
from pinecone import Pinecone
from pinecone import ServerlessSpec
api_key = os.environ["PINECONE_API_KEY"]
pc = Pinecone(api_key=api_key)
# delete if needed
# pc.delete_index("quickstart-index")
# Dimensions are for text-embedding-ada-002
try:
pc.create_index(
"quickstart-index",
dimension=1536,
metric="euclidean",
spec=ServerlessSpec(cloud="aws", region="us-west-2"),
)
except Exception as e:
# Most likely index already exists
print(e)
pass
pinecone_index = pc.Index("quickstart-index")
加载文档,构建 PineconeVectorStore 和 VectorStoreIndex¶
from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.vector_stores.pinecone import PineconeVectorStore
from llama_index.core.schema import TextNode
nodes = [
TextNode(
text="The Shawshank Redemption",
metadata={
"author": "Stephen King",
"theme": "Friendship",
"year": 1994,
},
),
TextNode(
text="The Godfather",
metadata={
"director": "Francis Ford Coppola",
"theme": "Mafia",
"year": 1972,
},
),
TextNode(
text="Inception",
metadata={
"director": "Christopher Nolan",
"theme": "Fiction",
"year": 2010,
},
),
TextNode(
text="To Kill a Mockingbird",
metadata={
"author": "Harper Lee",
"theme": "Fiction",
"year": 1960,
},
),
TextNode(
text="1984",
metadata={
"author": "George Orwell",
"theme": "Totalitarianism",
"year": 1949,
},
),
TextNode(
text="The Great Gatsby",
metadata={
"author": "F. Scott Fitzgerald",
"theme": "The American Dream",
"year": 1925,
},
),
TextNode(
text="Harry Potter and the Sorcerer's Stone",
metadata={
"author": "J.K. Rowling",
"theme": "Fiction",
"year": 1997,
},
),
]
vector_store = PineconeVectorStore(
pinecone_index=pinecone_index,
namespace="test",
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex(nodes, storage_context=storage_context)
Upserted vectors: 0%| | 0/7 [00:00<?, ?it/s]
1.b 定义 Autoretriever,运行一些示例查询¶
设置 VectorIndexAutoRetriever
¶
其中一个输入是描述向量存储集合包含什么内容的 schema
。这类似于描述 SQL 数据库中表的表 schema。然后将此 schema 信息注入到 prompt 中,再传递给 LLM 以推断完整的查询应该是什么(包括元数据过滤器)。
from llama_index.core.retrievers import VectorIndexAutoRetriever
from llama_index.core.vector_stores import MetadataInfo, VectorStoreInfo
vector_store_info = VectorStoreInfo(
content_info="famous books and movies",
metadata_info=[
MetadataInfo(
name="director",
type="str",
description=("Name of the director"),
),
MetadataInfo(
name="theme",
type="str",
description=("Theme of the book/movie"),
),
MetadataInfo(
name="year",
type="int",
description=("Year of the book/movie"),
),
],
)
retriever = VectorIndexAutoRetriever(
index,
vector_store_info=vector_store_info,
empty_query_top_k=10,
# this is a hack to allow for blank queries in pinecone
default_empty_query_vector=[0] * 1536,
verbose=True,
)
运行一些查询¶
让我们运行一些利用结构化信息的示例查询。
nodes = retriever.retrieve(
"Tell me about some books/movies after the year 2000"
)
Using query str: Using filters: [('year', '>', 2000)]
for node in nodes:
print(node.text)
print(node.metadata)
Inception {'director': 'Christopher Nolan', 'theme': 'Fiction', 'year': 2010}
nodes = retriever.retrieve("Tell me about some books that are Fiction")
Using query str: Fiction Using filters: [('theme', '==', 'Fiction')]
for node in nodes:
print(node.text)
print(node.metadata)
Inception {'director': 'Christopher Nolan', 'theme': 'Fiction', 'year': 2010} To Kill a Mockingbird {'author': 'Harper Lee', 'theme': 'Fiction', 'year': 1960}
传入附加元数据过滤器¶
如果您有希望传入但未自动推断出的附加元数据过滤器,请执行以下操作。
from llama_index.core.vector_stores import MetadataFilters
filter_dicts = [{"key": "year", "operator": "==", "value": 1997}]
filters = MetadataFilters.from_dicts(filter_dicts)
retriever2 = VectorIndexAutoRetriever(
index,
vector_store_info=vector_store_info,
empty_query_top_k=10,
# this is a hack to allow for blank queries in pinecone
default_empty_query_vector=[0] * 1536,
extra_filters=filters,
)
nodes = retriever2.retrieve("Tell me about some books that are Fiction")
for node in nodes:
print(node.text)
print(node.metadata)
Harry Potter and the Sorcerer's Stone {'author': 'J.K. Rowling', 'theme': 'Fiction', 'year': 1997}
查询失败示例¶
请注意,没有检索到结果!我们稍后会修复这个问题。
nodes = retriever.retrieve("Tell me about some books that are mafia-themed")
Using query str: books Using filters: [('theme', '==', 'mafia')]
for node in nodes:
print(node.text)
print(node.metadata)
第 2 部分:扩展自动检索(通过动态元数据检索)¶
现在,我们通过自定义 prompt 来扩展自动检索。在第一部分,我们明确添加一些规则。
在第二部分,我们实现动态元数据检索,它将执行第一阶段检索,从向量数据库中获取相关元数据,并将其作为 few-shot 示例插入到自动检索 prompt 中。(当然,第二阶段检索会从向量数据库中检索实际项目)。
2.a 改进自动检索 Prompt¶
我们的自动检索 prompt 可以工作,但可以通过多种方式改进。一些例子包括它包含了 2 个硬编码的 few-shot 示例(如何包含您自己的示例?),以及自动检索并非“总是”推断出正确的元数据过滤器。
例如,所有的 theme
字段都是首字母大写的。我们如何告诉 LLM 这一点,以免它错误地推断出小写的“theme”?
让我们尝试修改一下 prompt!
from llama_index.core.prompts import display_prompt_dict
from llama_index.core import PromptTemplate
prompts_dict = retriever.get_prompts()
display_prompt_dict(prompts_dict)
# look at required template variables.
prompts_dict["prompt"].template_vars
['schema_str', 'info_str', 'query_str']
自定义 Prompt¶
让我们稍微自定义一下 prompt。我们执行以下操作:
- 移除第一个 few-shot 示例以节省 tokens
- 添加一条消息,如果推断“theme”,则总是将首字母大写。
注意,prompt 模板需要定义 schema_str
、info_str
和 query_str
。
# write prompt template, and modify it.
prompt_tmpl_str = """\
Your goal is to structure the user's query to match the request schema provided below.
<< Structured Request Schema >>
When responding use a markdown code snippet with a JSON object formatted in the following schema:
{schema_str}
The query string should contain only text that is expected to match the contents of documents. Any conditions in the filter should not be mentioned in the query as well.
Make sure that filters only refer to attributes that exist in the data source.
Make sure that filters take into account the descriptions of attributes.
Make sure that filters are only used as needed. If there are no filters that should be applied return [] for the filter value.
If the user's query explicitly mentions number of documents to retrieve, set top_k to that number, otherwise do not set top_k.
Do NOT EVER infer a null value for a filter. This will break the downstream program. Instead, don't include the filter.
<< Example 1. >>
Data Source:
```json
{{
"metadata_info": [
{{
"name": "author",
"type": "str",
"description": "Author name"
}},
{{
"name": "book_title",
"type": "str",
"description": "Book title"
}},
{{
"name": "year",
"type": "int",
"description": "Year Published"
}},
{{
"name": "pages",
"type": "int",
"description": "Number of pages"
}},
{{
"name": "summary",
"type": "str",
"description": "A short summary of the book"
}}
],
"content_info": "Classic literature"
}}
```
User Query:
What are some books by Jane Austen published after 1813 that explore the theme of marriage for social standing?
Additional Instructions:
None
Structured Request:
```json
{{"query": "Books related to theme of marriage for social standing", "filters": [{{"key": "year", "value": "1813", "operator": ">"}}, {{"key": "author", "value": "Jane Austen", "operator": "=="}}], "top_k": null}}
```
<< Example 2. >>
Data Source:
```json
{info_str}
```
User Query:
{query_str}
Additional Instructions:
{additional_instructions}
Structured Request:
"""
prompt_tmpl = PromptTemplate(prompt_tmpl_str)
您会注意到我们添加了一个 additional_instructions
模板变量。这使我们能够插入特定于向量集合的说明。
我们将使用 partial_format
添加说明。
add_instrs = """\
If one of the filters is 'theme', please make sure that the first letter of the inferred value is capitalized. Only words that are capitalized are valid values for "theme". \
"""
prompt_tmpl = prompt_tmpl.partial_format(additional_instructions=add_instrs)
retriever.update_prompts({"prompt": prompt_tmpl})
重新运行一些查询¶
现在让我们尝试重新运行一些查询,您将看到该值是自动推断出来的。
nodes = retriever.retrieve(
"Tell me about some books that are friendship-themed"
)
for node in nodes:
print(node.text)
print(node.metadata)
2.b 实现动态元数据检索¶
除了在 prompt 中硬编码规则外,另一种选择是检索相关的元数据 few-shot 示例,以帮助 LLM 更好地推断正确的元数据过滤器。
这将更好地防止 LLM 在推断“where”子句时出错,尤其是在值的拼写/正确格式等方面。
我们可以通过向量检索来实现这一点。现有的向量数据库集合存储原始文本 + 元数据;我们可以直接查询此集合,或者单独仅对元数据进行索引并从中检索。在本节中,我们选择前者,但在实践中您可能希望选择后者。
# define retriever that fetches the top 2 examples.
metadata_retriever = index.as_retriever(similarity_top_k=2)
我们使用上一节中定义的相同 prompt_tmpl_str
。
from typing import List, Any
def format_additional_instrs(**kwargs: Any) -> str:
"""Format examples into a string."""
nodes = metadata_retriever.retrieve(kwargs["query_str"])
context_str = (
"Here is the metadata of relevant entries from the database collection. "
"This should help you infer the right filters: \n"
)
for node in nodes:
context_str += str(node.node.metadata) + "\n"
return context_str
ext_prompt_tmpl = PromptTemplate(
prompt_tmpl_str,
function_mappings={"additional_instructions": format_additional_instrs},
)
retriever.update_prompts({"prompt": ext_prompt_tmpl})
重新运行一些查询¶
现在让我们尝试重新运行一些查询,您将看到该值是自动推断出来的。
nodes = retriever.retrieve("Tell me about some books that are mafia-themed")
for node in nodes:
print(node.text)
print(node.metadata)
Using query str: books Using filters: [('theme', '==', 'Mafia')] The Godfather {'director': 'Francis Ford Coppola', 'theme': 'Mafia', 'year': 1972}
nodes = retriever.retrieve("Tell me some books authored by HARPER LEE")
for node in nodes:
print(node.text)
print(node.metadata)
Using query str: Books authored by Harper Lee Using filters: [('author', '==', 'Harper Lee')] To Kill a Mockingbird {'author': 'Harper Lee', 'theme': 'Fiction', 'year': 1960}