Chroma¶

Chroma 是一个专注于开发者生产力和快乐的 AI 原生开源向量数据库。Chroma 根据 Apache 2.0 许可。

Chroma 是完全类型化、完全测试并通过完善文档记录的。

安装 Chroma

pip install chromadb

Chroma 支持多种运行模式。请参阅下面的示例，了解如何将其与 LlamaIndex 集成。

in-memory（内存模式）- 在 Python 脚本或 Jupyter Notebook 中使用
in-memory with persistance（带持久化的内存模式）- 在脚本或 Notebook 中使用，并将数据保存/加载到磁盘
in a docker container（在 Docker 容器中）- 作为服务器在本地机器或云端运行

像任何其他数据库一样，您可以

.add
.get
.update
.upsert
.delete
.peek
以及 .query 运行相似性搜索。

查看完整文档：docs。

基本示例¶

在这个基本示例中，我们使用 Paul Graham 的文章，将其分割成块，使用开源嵌入模型进行嵌入，然后将其加载到 Chroma 中进行查询。

如果您在 colab 中打开此 Notebook，您可能需要安装 LlamaIndex 🦙。

In [ ]

已复制！

%pip install llama-index-vector-stores-chroma
%pip install llama-index-embeddings-huggingface
%pip install llama-index-vector-stores-chroma %pip install llama-index-embeddings-huggingface

In [ ]

已复制！

!pip install llama-index
!pip install llama-index

创建 Chroma 索引¶

In [ ]

已复制！

# !pip install llama-index chromadb --quiet
# !pip install chromadb
# !pip install sentence-transformers
# !pip install pydantic==1.10.11
# !pip install llama-index chromadb --quiet # !pip install chromadb # !pip install sentence-transformers # !pip install pydantic==1.10.11

In [ ]

已复制！





# import
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import StorageContext
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from IPython.display import Markdown, display
import chromadb
# import from llama_index.core import VectorStoreIndex, SimpleDirectoryReader from llama_index.vector_stores.chroma import ChromaVectorStore from llama_index.core import StorageContext from llama_index.embeddings.huggingface import HuggingFaceEmbedding from IPython.display import Markdown, display import chromadb

In [ ]

已复制！

# set up OpenAI
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")
import openai

openai.api_key = os.environ["OPENAI_API_KEY"]
# set up OpenAI import os import getpass os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:") import openai openai.api_key = os.environ["OPENAI_API_KEY"]

下载数据

In [ ]

已复制！

!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
!mkdir -p 'data/paul_graham/' !wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

In [ ]

已复制！





# create client and a new collection
chroma_client = chromadb.EphemeralClient()
chroma_collection = chroma_client.create_collection("quickstart")

# define embedding function
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-base-en-v1.5")

# load documents
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()

# set up ChromaVectorStore and load in data
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context, embed_model=embed_model
)

# Query Data
query_engine = index.as_query_engine()
response = query_engine.query("What did the author do growing up?")
display(Markdown(f"<b>{response}</b>"))
# create client and a new collection chroma_client = chromadb.EphemeralClient() chroma_collection = chroma_client.create_collection("quickstart") # define embedding function embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-base-en-v1.5") # load documents documents = SimpleDirectoryReader("./data/paul_graham/").load_data() # set up ChromaVectorStore and load in data vector_store = ChromaVectorStore(chroma_collection=chroma_collection) storage_context = StorageContext.from_defaults(vector_store=vector_store) index = VectorStoreIndex.from_documents( documents, storage_context=storage_context, embed_model=embed_model ) # Query Data query_engine = index.as_query_engine() response = query_engine.query("What did the author do growing up?") display(Markdown(f"{response}"))

/Users/loganmarkewich/llama_index/llama-index/lib/python3.9/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
/Users/loganmarkewich/llama_index/llama-index/lib/python3.9/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
  warn("The installed version of bitsandbytes was compiled without GPU support. "

'NoneType' object has no attribute 'cadam32bit_grad_fp32'

作者从小从事写作和编程。他们写短篇故事，并尝试在 IBM 1401 计算机上编写程序。后来，他们买了微型计算机，并开始更广泛地编程。

基本示例（包括保存到磁盘）¶

在前一个示例的基础上，如果您想保存到磁盘，只需初始化 Chroma 客户端并传入您希望保存数据的目录即可。

注意：Chroma 会尽最大努力自动将数据保存到磁盘，但多个内存客户端可能会相互干扰。最佳实践是，每个路径在任何给定时间只运行一个客户端。

In [ ]

已复制！





# save to disk

db = chromadb.PersistentClient(path="./chroma_db")
chroma_collection = db.get_or_create_collection("quickstart")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context, embed_model=embed_model
)

# load from disk
db2 = chromadb.PersistentClient(path="./chroma_db")
chroma_collection = db2.get_or_create_collection("quickstart")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
index = VectorStoreIndex.from_vector_store(
    vector_store,
    embed_model=embed_model,
)

# Query Data from the persisted index
query_engine = index.as_query_engine()
response = query_engine.query("What did the author do growing up?")
display(Markdown(f"<b>{response}</b>"))
# save to disk db = chromadb.PersistentClient(path="./chroma_db") chroma_collection = db.get_or_create_collection("quickstart") vector_store = ChromaVectorStore(chroma_collection=chroma_collection) storage_context = StorageContext.from_defaults(vector_store=vector_store) index = VectorStoreIndex.from_documents( documents, storage_context=storage_context, embed_model=embed_model ) # load from disk db2 = chromadb.PersistentClient(path="./chroma_db") chroma_collection = db2.get_or_create_collection("quickstart") vector_store = ChromaVectorStore(chroma_collection=chroma_collection) index = VectorStoreIndex.from_vector_store( vector_store, embed_model=embed_model, ) # Query Data from the persisted index query_engine = index.as_query_engine() response = query_engine.query("What did the author do growing up?") display(Markdown(f"{response}"))

作者从小从事写作和编程。他们写短篇故事，并尝试在 IBM 1401 计算机上编写程序。后来，他们买了微型计算机，并开始编程游戏和文字处理器。

基本示例（使用 Docker 容器）¶

您也可以在单独的 Docker 容器中运行 Chroma Server，创建一个客户端连接到它，然后将其传递给 LlamaIndex。

以下是如何克隆、构建和运行 Docker 镜像

git clone [email protected]:chroma-core/chroma.git
docker-compose up -d --build

In [ ]

已复制！





# create the chroma client and add our data
import chromadb

remote_db = chromadb.HttpClient()
chroma_collection = remote_db.get_or_create_collection("quickstart")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context, embed_model=embed_model
)
# create the chroma client and add our data import chromadb remote_db = chromadb.HttpClient() chroma_collection = remote_db.get_or_create_collection("quickstart") vector_store = ChromaVectorStore(chroma_collection=chroma_collection) storage_context = StorageContext.from_defaults(vector_store=vector_store) index = VectorStoreIndex.from_documents( documents, storage_context=storage_context, embed_model=embed_model )

In [ ]

已复制！

# Query Data from the Chroma Docker index
query_engine = index.as_query_engine()
response = query_engine.query("What did the author do growing up?")
display(Markdown(f"<b>{response}</b>"))
# Query Data from the Chroma Docker index query_engine = index.as_query_engine() response = query_engine.query("What did the author do growing up?") display(Markdown(f"{response}"))

作者从小写短篇故事，在 IBM 1401 上编程，并在 TRS-80 微型计算机上编写程序。他还在哈佛大学学习绘画课程，并担任一位画家的事实上的工作室助理。他还尝试创办一家公司，将艺术画廊搬到线上，并编写软件来构建在线商店。

更新和删除¶

在构建实际应用程序时，您不仅需要添加数据，还需要更新和删除数据。

Chroma 要求用户提供 ids 来简化这里的记账工作。ids 可以是文件名，或者像 filename_paragraphNumber 这样的组合哈希值等。

以下是展示如何执行各种操作的基本示例

In [ ]

已复制！





doc_to_update = chroma_collection.get(limit=1)
doc_to_update["metadatas"][0] = {
    **doc_to_update["metadatas"][0],
    **{"author": "Paul Graham"},
}
chroma_collection.update(
    ids=[doc_to_update["ids"][0]], metadatas=[doc_to_update["metadatas"][0]]
)
updated_doc = chroma_collection.get(limit=1)
print(updated_doc["metadatas"][0])

# delete the last document
print("count before", chroma_collection.count())
chroma_collection.delete(ids=[doc_to_update["ids"][0]])
print("count after", chroma_collection.count())
doc_to_update = chroma_collection.get(limit=1) doc_to_update["metadatas"][0] = { **doc_to_update["metadatas"][0], **{"author": "Paul Graham"}, } chroma_collection.update( ids=[doc_to_update["ids"][0]], metadatas=[doc_to_update["metadatas"][0]] ) updated_doc = chroma_collection.get(limit=1) print(updated_doc["metadatas"][0]) # delete the last document print("count before", chroma_collection.count()) chroma_collection.delete(ids=[doc_to_update["ids"][0]]) print("count after", chroma_collection.count())

{'_node_content': '{"id_": "be08c8bc-f43e-4a71-ba64-e525921a8319", "embedding": null, "metadata": {}, "excluded_embed_metadata_keys": [], "excluded_llm_metadata_keys": [], "relationships": {"1": {"node_id": "2cbecdbb-0840-48b2-8151-00119da0995b", "node_type": null, "metadata": {}, "hash": "4c702b4df575421e1d1af4b1fd50511b226e0c9863dbfffeccb8b689b8448f35"}, "3": {"node_id": "6a75604a-fa76-4193-8f52-c72a7b18b154", "node_type": null, "metadata": {}, "hash": "d6c408ee1fbca650fb669214e6f32ffe363b658201d31c204e85a72edb71772f"}}, "hash": "b4d0b960aa09e693f9dc0d50ef46a3d0bf5a8fb3ac9f3e4bcf438e326d17e0d8", "text": "", "start_char_idx": 0, "end_char_idx": 4050, "text_template": "{metadata_str}\\n\\n{content}", "metadata_template": "{key}: {value}", "metadata_seperator": "\\n"}', 'author': 'Paul Graham', 'doc_id': '2cbecdbb-0840-48b2-8151-00119da0995b', 'document_id': '2cbecdbb-0840-48b2-8151-00119da0995b', 'ref_doc_id': '2cbecdbb-0840-48b2-8151-00119da0995b'}
count before 20
count after 19