Couchbase向量存储演示

Couchbase 向量存储¶

Couchbase 是一款屡获殊荣的分布式 NoSQL 云数据库，为您的所有云、移动、AI 和边缘计算应用程序提供了无与伦比的多功能性、性能、可扩展性和财务价值。Couchbase 通过为开发者提供编码协助和为他们的应用程序提供向量搜索来拥抱 AI。

向量搜索是 Couchbase 中全文搜索服务（搜索服务）的一部分。

本教程解释了如何在 Couchbase 中使用向量搜索。您可以同时使用 Couchbase Capella 和您自管理的 Couchbase Server。

安装¶

如果您在 colab 上打开本 Notebook，您可能需要安装 LlamaIndex 🦙。

In [ ]

已复制！

%pip install llama-index-vector-stores-couchbase
%pip install llama-index-vector-stores-couchbase

In [ ]

已复制！

!pip install llama-index
!pip install llama-index

创建 Couchbase 连接¶

我们首先创建与 Couchbase 集群的连接，然后将集群对象传递给向量存储。

在这里，我们使用用户名和密码进行连接。您也可以使用任何其他支持的方式连接到您的集群。

有关连接到 Couchbase 集群的更多信息，请查看 Python SDK 文档。

In [ ]

已复制！

COUCHBASE_CONNECTION_STRING = (
    "couchbase://localhost"  # or "couchbases://localhost" if using TLS
)
DB_USERNAME = "Administrator"
DB_PASSWORD = "P@ssword1!"
COUCHBASE_CONNECTION_STRING = ( "couchbase://localhost" # 或如果使用 TLS 则为 "couchbases://localhost" ) DB_USERNAME = "Administrator" DB_PASSWORD = "P@ssword1!"

In [ ]

已复制！

from datetime import timedelta

from couchbase.auth import PasswordAuthenticator
from couchbase.cluster import Cluster
from couchbase.options import ClusterOptions

auth = PasswordAuthenticator(DB_USERNAME, DB_PASSWORD)
options = ClusterOptions(auth)
cluster = Cluster(COUCHBASE_CONNECTION_STRING, options)

# Wait until the cluster is ready for use.
cluster.wait_until_ready(timedelta(seconds=5))
from datetime import timedelta from couchbase.auth import PasswordAuthenticator from couchbase.cluster import Cluster from couchbase.options import ClusterOptions auth = PasswordAuthenticator(DB_USERNAME, DB_PASSWORD) options = ClusterOptions(auth) cluster = Cluster(COUCHBASE_CONNECTION_STRING, options) # Wait until the cluster is ready for use. # 等待集群准备就绪。 cluster.wait_until_ready(timedelta(seconds=5))

创建搜索索引¶

目前，搜索索引需要从 Couchbase Capella 或 Server UI 或使用 REST 接口创建。

让我们在 testing 桶上定义一个名为 vector-index 的搜索索引。

在本例中，我们使用 UI 上搜索服务中的导入索引功能。

我们正在测试桶的 _default 作用域的 _default 集合上定义一个索引，其中向量字段设置为 embedding，维度为 1536，文本字段设置为 text。我们还将文档中元数据下的所有字段作为动态映射进行索引和存储，以应对不同的文档结构。相似度指标设置为 dot_product。

如何将索引导入全文搜索服务？¶

Couchbase Server
- 点击搜索 -> 添加索引 -> 导入
- 在导入屏幕中复制以下索引定义
- 点击创建索引以创建索引。
Couchbase Capella
- 将索引定义复制到新文件 index.json
- 根据文档中的说明在 Capella 中导入该文件。
- 点击创建索引以创建索引。

索引定义¶

{
 "name": "vector-index",
 "type": "fulltext-index",
 "params": {
  "doc_config": {
   "docid_prefix_delim": "",
   "docid_regexp": "",
   "mode": "type_field",
   "type_field": "type"
  },
  "mapping": {
   "default_analyzer": "standard",
   "default_datetime_parser": "dateTimeOptional",
   "default_field": "_all",
   "default_mapping": {
    "dynamic": true,
    "enabled": true,
    "properties": {
     "metadata": {
      "dynamic": true,
      "enabled": true
     },
     "embedding": {
      "enabled": true,
      "dynamic": false,
      "fields": [
       {
        "dims": 1536,
        "index": true,
        "name": "embedding",
        "similarity": "dot_product",
        "type": "vector",
        "vector_index_optimized_for": "recall"
       }
      ]
     },
     "text": {
      "enabled": true,
      "dynamic": false,
      "fields": [
       {
        "index": true,
        "name": "text",
        "store": true,
        "type": "text"
       }
      ]
     }
    }
   },
   "default_type": "_default",
   "docvalues_dynamic": false,
   "index_dynamic": true,
   "store_dynamic": true,
   "type_field": "_type"
  },
  "store": {
   "indexType": "scorch",
   "segmentVersion": 16
  }
 },
 "sourceType": "gocbcore",
 "sourceName": "testing",
 "sourceParams": {},
 "planParams": {
  "maxPartitionsPerPIndex": 103,
  "indexPartitions": 10,
  "numReplicas": 0
 }
}

我们现在将设置要在 Couchbase 集群中用于向量搜索的桶、作用域和集合名称。

在本例中，我们使用默认的作用域和集合。

In [ ]

已复制！

BUCKET_NAME = "testing"
SCOPE_NAME = "_default"
COLLECTION_NAME = "_default"
SEARCH_INDEX_NAME = "vector-index"
BUCKET_NAME = "testing" SCOPE_NAME = "_default" COLLECTION_NAME = "_default" SEARCH_INDEX_NAME = "vector-index"

In [ ]

已复制！

# Import required packages
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core import StorageContext
from llama_index.core import Settings
from llama_index.vector_stores.couchbase import CouchbaseVectorStore
# Import required packages # 导入所需包 from llama_index.core import VectorStoreIndex, SimpleDirectoryReader from llama_index.core import StorageContext from llama_index.core import Settings from llama_index.vector_stores.couchbase import CouchbaseVectorStore

在本教程中，我们将使用 OpenAI 嵌入。

In [ ]

已复制！

import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")
import os import getpass os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:") # 导入 OpenAI API 密钥

In [ ]

已复制！

import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
import logging import sys logging.basicConfig(stream=sys.stdout, level=logging.INFO) logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout)) # 配置日志

下载数据¶

In [ ]

已复制！

!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
!mkdir -p 'data/paul_graham/' !wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

--2024-04-09 23:31:46--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8000::154, 2606:50c0:8001::154, 2606:50c0:8003::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8000::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75042 (73K) [text/plain]
Saving to: ‘data/paul_graham/paul_graham_essay.txt’

data/paul_graham/pa 100%[===================>]  73.28K  --.-KB/s    in 0.008s  

2024-04-09 23:31:46 (8.97 MB/s) - ‘data/paul_graham/paul_graham_essay.txt’ saved [75042/75042]

加载文档¶

In [ ]

已复制！

# load documents
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
# load documents # 加载文档 documents = SimpleDirectoryReader("./data/paul_graham/").load_data()

In [ ]

已复制！





vector_store = CouchbaseVectorStore(
    cluster=cluster,
    bucket_name=BUCKET_NAME,
    scope_name=SCOPE_NAME,
    collection_name=COLLECTION_NAME,
    index_name=SEARCH_INDEX_NAME,
)
vector_store = CouchbaseVectorStore( cluster=cluster, bucket_name=BUCKET_NAME, scope_name=SCOPE_NAME, collection_name=COLLECTION_NAME, index_name=SEARCH_INDEX_NAME, )

In [ ]

已复制！

storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)
storage_context = StorageContext.from_defaults(vector_store=vector_store) index = VectorStoreIndex.from_documents( documents, storage_context=storage_context )

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"

基本示例¶

我们将向查询引擎提问关于我们刚刚索引的文章的问题。

In [ ]

已复制！

query_engine = index.as_query_engine()
response = query_engine.query("What were his investments in Y Combinator?")
print(response)
query_engine = index.as_query_engine() response = query_engine.query("What were his investments in Y Combinator?") print(response)

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
His investments in Y Combinator were $6k per founder, totaling $12k in the typical two-founder case, in return for 6% equity.

元数据过滤器¶

我们将创建一些带有元数据的示例文档，以便查看如何根据元数据过滤文档。

In [ ]

已复制！





from llama_index.core.schema import TextNode

nodes = [
    TextNode(
        text="The Shawshank Redemption",
        metadata={
            "author": "Stephen King",
            "theme": "Friendship",
        },
    ),
    TextNode(
        text="The Godfather",
        metadata={
            "director": "Francis Ford Coppola",
            "theme": "Mafia",
        },
    ),
    TextNode(
        text="Inception",
        metadata={
            "director": "Christopher Nolan",
        },
    ),
]
vector_store.add(nodes)
from llama_index.core.schema import TextNode nodes = [ TextNode( text="The Shawshank Redemption", metadata={ "author": "Stephen King", "theme": "Friendship", }, ), TextNode( text="The Godfather", metadata={ "director": "Francis Ford Coppola", "theme": "Mafia", }, ), TextNode( text="Inception", metadata={ "director": "Christopher Nolan", }, ), ] vector_store.add(nodes)

Out[ ]

['5abb42cf-7312-46eb-859e-60df4f92842a',
 'b90525f4-38bf-453c-a51a-5f0718bccc98',
 '22f732d0-da17-4bad-b3cd-b54e2102367a']

In [ ]

已复制！

# Metadata filter
from llama_index.core.vector_stores import ExactMatchFilter, MetadataFilters

filters = MetadataFilters(
    filters=[ExactMatchFilter(key="theme", value="Mafia")]
)

retriever = index.as_retriever(filters=filters)

retriever.retrieve("What is inception about?")
# Metadata filter # 元数据过滤器 from llama_index.core.vector_stores import ExactMatchFilter, MetadataFilters filters = MetadataFilters( filters=[ExactMatchFilter(key="theme", value="Mafia")] ) retriever = index.as_retriever(filters=filters) retriever.retrieve("What is inception about?")

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"

Out[ ]

[NodeWithScore(node=TextNode(id_='b90525f4-38bf-453c-a51a-5f0718bccc98', embedding=None, metadata={'director': 'Francis Ford Coppola', 'theme': 'Mafia'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='The Godfather', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'), score=0.3068528194400547)]

自定义过滤器和覆盖查询¶

Couchbase 目前仅通过 LlamaIndex 支持 ExactMatchFilters。Couchbase 支持多种过滤器，包括范围过滤器、地理空间过滤器等。要使用这些过滤器，您可以将它们作为字典列表传递给 cb_search_options 参数。有关 search_options 的不同搜索/查询可能性，请参见此处。

In [ ]

已复制！





def custom_query(query, query_str):
    print("custom query", query)
    return query


query_engine = index.as_query_engine(
    vector_store_kwargs={
        "cb_search_options": {
            "query": {"match": "growing up", "field": "text"}
        },
        "custom_query": custom_query,
    }
)
response = query_engine.query("what were his investments in Y Combinator?")
print(response)
def custom_query(query, query_str): print("custom query", query) return query query_engine = index.as_query_engine( vector_store_kwargs={ "cb_search_options": { "query": {"match": "growing up", "field": "text"} }, "custom_query": custom_query, } ) response = query_engine.query("what were his investments in Y Combinator?") print(response)

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
His investments in Y Combinator were based on a combination of the deal he did with Julian ($10k for 10%) and what Robert said MIT grad students got for the summer ($6k). He invested $6k per founder, which in the typical two-founder case was $12k, in return for 6%.