阿里云 OpenSearch 向量存储¶
阿里云 OpenSearch 向量搜索版是由阿里巴巴集团开发的大规模分布式搜索引擎。阿里云 OpenSearch 向量搜索版为整个阿里巴巴集团(包括淘宝、天猫、菜鸟、优酷)以及为中国大陆以外地区客户提供的其他电商平台提供搜索服务。阿里云 OpenSearch 向量搜索版也是阿里云 OpenSearch 的基础引擎。经过多年的发展,阿里云 OpenSearch 向量搜索版已经满足了高可用、高时效性和成本效益的业务需求。阿里云 OpenSearch 向量搜索版还提供自动化运维系统,您可以在其上根据您的业务特点构建自定义搜索服务。
要运行,您需要有一个实例。
设置¶
如果您在 colab 上打开此 Notebook,您可能需要安装 LlamaIndex 🦙。
%pip install llama-index-vector-stores-alibabacloud-opensearch
%pip install llama-index
import logging
import sys
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
请提供 OpenAI 访问密钥¶
要使用 OpenAI 的 embeddings,您需要提供 OpenAI API 密钥
import openai
OPENAI_API_KEY = getpass.getpass("OpenAI API Key:")
openai.api_key = OPENAI_API_KEY
下载数据¶
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
加载文档¶
from llama_index.core import SimpleDirectoryReader
from IPython.display import Markdown, display
# load documents
documents = SimpleDirectoryReader("./data/paul_graham").load_data()
print(f"Total documents: {len(documents)}")
Total documents: 1
创建阿里云 OpenSearch 向量存储对象:¶
要运行下一步,您应该有一个阿里云 OpenSearch 向量服务实例,并配置一个表。
# if run fllowing cells raise async io exception, run this
import nest_asyncio
nest_asyncio.apply()
# initialize without metadata filter
from llama_index.core import StorageContext, VectorStoreIndex
from llama_index.vector_stores.alibabacloud_opensearch import (
AlibabaCloudOpenSearchStore,
AlibabaCloudOpenSearchConfig,
)
config = AlibabaCloudOpenSearchConfig(
endpoint="*****",
instance_id="*****",
username="your_username",
password="your_password",
table_name="llama",
)
vector_store = AlibabaCloudOpenSearchStore(config)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
documents, storage_context=storage_context
)
查询索引¶
# set Logging to DEBUG for more detailed outputs
query_engine = index.as_query_engine()
response = query_engine.query("What did the author do growing up?")
display(Markdown(f"<b>{response}</b>"))
作者在大学之前一直从事写作和编程。他们在9年级时写短篇故事,并尝试使用早期版本的 Fortran 在 IBM 1401 上编写程序。
连接到现有存储¶
由于此存储由阿里云 OpenSearch 提供支持,因此根据定义它是持久的。因此,如果您想连接到先前创建并填充的存储,方法如下:
from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.alibabacloud_opensearch import (
AlibabaCloudOpenSearchStore,
AlibabaCloudOpenSearchConfig,
)
config = AlibabaCloudOpenSearchConfig(
endpoint="***",
instance_id="***",
username="your_username",
password="your_password",
table_name="llama",
)
vector_store = AlibabaCloudOpenSearchStore(config)
# Create index from existing stored vectors
index = VectorStoreIndex.from_vector_store(vector_store)
query_engine = index.as_query_engine()
response = query_engine.query(
"What did the author study prior to working on AI?"
)
display(Markdown(f"<b>{response}</b>"))
元数据过滤¶
阿里云 OpenSearch 向量存储支持在查询时进行元数据过滤。以下单元格在一个全新的表上演示此功能。
在此演示中,为了简洁起见,仅加载了一个源文档(../data/paul_graham/paul_graham_essay.txt
文本文件)。尽管如此,您将为文档附加一些自定义元数据,以说明如何通过文档附带的元数据上的条件来限制查询。
from llama_index.core import StorageContext, VectorStoreIndex
from llama_index.vector_stores.alibabacloud_opensearch import (
AlibabaCloudOpenSearchStore,
AlibabaCloudOpenSearchConfig,
)
config = AlibabaCloudOpenSearchConfig(
endpoint="****",
instance_id="****",
username="your_username",
password="your_password",
table_name="llama",
)
md_storage_context = StorageContext.from_defaults(
vector_store=AlibabaCloudOpenSearchStore(config)
)
def my_file_metadata(file_name: str):
"""Depending on the input file name, associate a different metadata."""
if "essay" in file_name:
source_type = "essay"
elif "dinosaur" in file_name:
# this (unfortunately) will not happen in this demo
source_type = "dinos"
else:
source_type = "other"
return {"source_type": source_type}
# Load documents and build index
md_documents = SimpleDirectoryReader(
"../data/paul_graham", file_metadata=my_file_metadata
).load_data()
md_index = VectorStoreIndex.from_documents(
md_documents, storage_context=md_storage_context
)
将过滤器添加到查询引擎
from llama_index.core.vector_stores import MetadataFilter, MetadataFilters
md_query_engine = md_index.as_query_engine(
filters=MetadataFilters(
filters=[MetadataFilter(key="source_type", value="essay")]
)
)
md_response = md_query_engine.query(
"How long it took the author to write his thesis?"
)
display(Markdown(f"<b>{md_response}</b>"))
为了测试过滤是否生效,请尝试更改为仅使用 "dinos"
文档...这次将没有答案 :)