ClickHouse 向量存储¶

在本 Notebook 中，我们将演示如何使用 ClickHouseVectorStore。

如果您在 Colab 上打开此 Notebook，您可能需要安装 LlamaIndex 🦙。

In [ ]

已复制！(Copied!)

!pip install llama-index
!pip install clickhouse_connect
!pip install llama-index !pip install clickhouse_connect

创建 ClickHouse 客户端¶

In [ ]

已复制！(Copied!)

import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
import logging import sys logging.basicConfig(stream=sys.stdout, level=logging.INFO) logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

In [ ]

已复制！(Copied!)





from os import environ
import clickhouse_connect

environ["OPENAI_API_KEY"] = "sk-*"

# initialize client
client = clickhouse_connect.get_client(
    host="localhost",
    port=8123,
    username="default",
    password="",
)
from os import environ import clickhouse_connect environ["OPENAI_API_KEY"] = "sk-*" # 初始化客户端 client = clickhouse_connect.get_client( host="localhost", port=8123, username="default", password="", )

加载文档，构建并存储带有 ClickHouseVectorStore 的 VectorStoreIndex¶

在这里，我们将使用一套 Paul Graham 的文章作为文本，将其转换为嵌入向量，存储在 ClickHouseVectorStore 中，并通过查询来为我们的 LLM 问答循环查找上下文。

In [ ]

已复制！(Copied!)

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.vector_stores.clickhouse import ClickHouseVectorStore
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader from llama_index.vector_stores.clickhouse import ClickHouseVectorStore

In [ ]

已复制！(Copied!)

# load documents
documents = SimpleDirectoryReader("../data/paul_graham").load_data()
print("Document ID:", documents[0].doc_id)
print("Number of Documents: ", len(documents))
# 加载文档 documents = SimpleDirectoryReader("../data/paul_graham").load_data() print("文档 ID:", documents[0].doc_id) print("文档数量: ", len(documents))

Document ID: d03ac7db-8dae-4199-bc38-445dec51a534
Number of Documents:  1

下载数据 (Download Data)

In [ ]

已复制！(Copied!)

!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
!mkdir -p 'data/paul_graham/' !wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

--2024-02-13 10:08:31--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75042 (73K) [text/plain]
Saving to: ‘data/paul_graham/paul_graham_essay.txt’

data/paul_graham/pa 100%[===================>]  73.28K  --.-KB/s    in 0.003s  

2024-02-13 10:08:31 (23.9 MB/s) - ‘data/paul_graham/paul_graham_essay.txt’ saved [75042/75042]

您可以使用 SimpleDirectoryReader 单独处理您的文件

In [ ]

已复制！(Copied!)

loader = SimpleDirectoryReader("./data/paul_graham/")
documents = loader.load_data()
for file in loader.input_files:
    print(file)
    # Here is where you would do any preprocessing
loader = SimpleDirectoryReader("./data/paul_graham/") documents = loader.load_data() for file in loader.input_files: print(file) # 您可以在这里进行任何预处理

data/paul_graham/paul_graham_essay.txt

In [ ]

已复制！(Copied!)





# initialize with metadata filter and store indexes
from llama_index.core import StorageContext

for document in documents:
    document.metadata = {"user_id": "123", "favorite_color": "blue"}
vector_store = ClickHouseVectorStore(clickhouse_client=client)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)
# 使用元数据过滤器和存储索引进行初始化 from llama_index.core import StorageContext for document in documents: document.metadata = {"user_id": "123", "favorite_color": "blue"} vector_store = ClickHouseVectorStore(clickhouse_client=client) storage_context = StorageContext.from_defaults(vector_store=vector_store) index = VectorStoreIndex.from_documents( documents, storage_context=storage_context )

查询索引¶

现在 ClickHouse 向量存储支持过滤搜索和混合搜索 (hybrid search)

您可以了解更多关于 query_engine 和 retriever 的信息。

In [ ]

已复制！(Copied!)





import textwrap

from llama_index.core.vector_stores import ExactMatchFilter, MetadataFilters

# set Logging to DEBUG for more detailed outputs
query_engine = index.as_query_engine(
    filters=MetadataFilters(
        filters=[
            ExactMatchFilter(key="user_id", value="123"),
        ]
    ),
    similarity_top_k=2,
    vector_store_query_mode="hybrid",
)
response = query_engine.query("What did the author learn?")
print(textwrap.fill(str(response), 100))
import textwrap from llama_index.core.vector_stores import ExactMatchFilter, MetadataFilters # 将日志级别设置为 DEBUG 以获得更详细的输出 query_engine = index.as_query_engine( filters=MetadataFilters( filters=[ ExactMatchFilter(key="user_id", value="123"), ] ), similarity_top_k=2, vector_store_query_mode="hybrid", ) response = query_engine.query("What did the author learn?") print(textwrap.fill(str(response), 100))

The author learned several things during their time at Interleaf, including the importance of having
technology companies run by product people rather than sales people, the drawbacks of having too
many people edit code, the value of corridor conversations over planned meetings, the challenges of
dealing with big bureaucratic customers, and the importance of being the "entry level" option in a
market.

清除所有索引¶

In [ ]

已复制！(Copied!)

for document in documents:
    index.delete_ref_doc(document.doc_id)
for document in documents: index.delete_ref_doc(document.doc_id)