作为先决条件,您需要有一个正在运行的 Epsilla 向量数据库(例如,通过我们的 docker 镜像),并安装 pyepsilla
包。请在 文档 中查看完整文档。
输入 [ ]
已复制!
%pip install llama-index-vector-stores-epsilla
%pip install llama-index-vector-stores-epsilla
输入 [ ]
已复制!
!pip/pip3 install pyepsilla
!pip/pip3 install pyepsilla
如果您在 colab 上打开此 Notebook,您可能需要安装 LlamaIndex 🦙。
输入 [ ]
已复制!
!pip install llama-index
!pip install llama-index
输入 [ ]
已复制!
import logging
import sys
# Uncomment to see debug logs
# logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
# logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
from llama_index.core import SimpleDirectoryReader, Document, StorageContext
from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.epsilla import EpsillaVectorStore
import textwrap
import logging import sys # Uncomment to see debug logs # logging.basicConfig(stream=sys.stdout, level=logging.DEBUG) # logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout)) from llama_index.core import SimpleDirectoryReader, Document, StorageContext from llama_index.core import VectorStoreIndex from llama_index.vector_stores.epsilla import EpsillaVectorStore import textwrap
设置 OpenAI¶
首先添加 OpenAI API 密钥。它将用于为加载到索引中的文档创建嵌入。
输入 [ ]
已复制!
import openai
import getpass
OPENAI_API_KEY = getpass.getpass("OpenAI API Key:")
openai.api_key = OPENAI_API_KEY
import openai import getpass OPENAI_API_KEY = getpass.getpass("OpenAI API Key:") openai.api_key = OPENAI_API_KEY
下载数据¶
输入 [ ]
已复制!
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
!mkdir -p 'data/paul_graham/' !wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
加载文档¶
使用 SimpleDirectoryReader 加载存储在 /data/paul_graham
文件夹中的文档。
输入 [ ]
已复制!
# load documents
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
print(f"Total documents: {len(documents)}")
print(f"First document, id: {documents[0].doc_id}")
print(f"First document, hash: {documents[0].hash}")
# load documents documents = SimpleDirectoryReader("./data/paul_graham/").load_data() print(f"Total documents: {len(documents)}") print(f"First document, id: {documents[0].doc_id}") print(f"First document, hash: {documents[0].hash}")
Total documents: 1 First document, id: ac7f23f0-ce15-4d94-a0a2-5020fa87df61 First document, hash: 4c702b4df575421e1d1af4b1fd50511b226e0c9863dbfffeccb8b689b8448f35
创建索引¶
这里我们使用之前加载的文档创建一个由 Epsilla 支持的索引。EpsillaVectorStore 接受一些参数。
client (Any): 连接到 Epsilla 的客户端。
collection_name (str, optional): 要使用的集合。默认为 "llama_collection"。
db_path (str, optional): 数据库将被持久化的路径。默认为 "/tmp/langchain-epsilla"。
db_name (str, optional): 给加载的数据库命名。默认为 "langchain_store"。
dimension (int, optional): 嵌入的维度。如果未提供,集合将在首次插入时创建。默认为 None。
overwrite (bool, optional): 是否覆盖同名现有集合。默认为 False。
Epsilla 向量数据库正在使用默认主机 "localhost" 和端口 "8888" 运行。
输入 [ ]
已复制!
# Create an index over the documnts
from pyepsilla import vectordb
client = vectordb.Client()
vector_store = EpsillaVectorStore(client=client, db_path="/tmp/llamastore")
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
documents, storage_context=storage_context
)
# Create an index over the documnts from pyepsilla import vectordb client = vectordb.Client() vector_store = EpsillaVectorStore(client=client, db_path="/tmp/llamastore") storage_context = StorageContext.from_defaults(vector_store=vector_store) index = VectorStoreIndex.from_documents( documents, storage_context=storage_context )
[INFO] Connected to localhost:8888 successfully.
查询数据¶
现在我们的文档已存储在索引中,我们可以针对索引提问。
输入 [ ]
已复制!
query_engine = index.as_query_engine()
response = query_engine.query("Who is the author?")
print(textwrap.fill(str(response), 100))
query_engine = index.as_query_engine() response = query_engine.query("Who is the author?") print(textwrap.fill(str(response), 100))
The author of the given context information is Paul Graham.
输入 [ ]
已复制!
response = query_engine.query("How did the author learn about AI?")
print(textwrap.fill(str(response), 100))
response = query_engine.query("How did the author learn about AI?") print(textwrap.fill(str(response), 100))
The author learned about AI through various sources. One source was a novel called "The Moon is a Harsh Mistress" by Heinlein, which featured an intelligent computer called Mike. Another source was a PBS documentary that showed Terry Winograd using SHRDLU, a program that could understand natural language. These experiences sparked the author's interest in AI and motivated them to start learning about it, including teaching themselves Lisp, which was regarded as the language of AI at the time.
接下来,我们尝试覆盖之前的数据。
输入 [ ]
已复制!
vector_store = EpsillaVectorStore(client=client, overwrite=True)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
single_doc = Document(text="Epsilla is the vector database we are using.")
index = VectorStoreIndex.from_documents(
[single_doc],
storage_context=storage_context,
)
query_engine = index.as_query_engine()
response = query_engine.query("Who is the author?")
print(textwrap.fill(str(response), 100))
vector_store = EpsillaVectorStore(client=client, overwrite=True) storage_context = StorageContext.from_defaults(vector_store=vector_store) single_doc = Document(text="Epsilla is the vector database we are using.") index = VectorStoreIndex.from_documents( [single_doc], storage_context=storage_context, ) query_engine = index.as_query_engine() response = query_engine.query("Who is the author?") print(textwrap.fill(str(response), 100))
There is no information provided about the author in the given context.
输入 [ ]
已复制!
response = query_engine.query("What vector database is being used?")
print(textwrap.fill(str(response), 100))
response = query_engine.query("What vector database is being used?") print(textwrap.fill(str(response), 100))
Epsilla is the vector database being used.
接下来,让我们向现有集合添加更多数据。
输入 [ ]
已复制!
vector_store = EpsillaVectorStore(client=client, overwrite=False)
index = VectorStoreIndex.from_vector_store(vector_store=vector_store)
for doc in documents:
index.insert(document=doc)
query_engine = index.as_query_engine()
response = query_engine.query("Who is the author?")
print(textwrap.fill(str(response), 100))
vector_store = EpsillaVectorStore(client=client, overwrite=False) index = VectorStoreIndex.from_vector_store(vector_store=vector_store) for doc in documents: index.insert(document=doc) query_engine = index.as_query_engine() response = query_engine.query("Who is the author?") print(textwrap.fill(str(response), 100))
The author of the given context information is Paul Graham.
输入 [ ]
已复制!
response = query_engine.query("What vector database is being used?")
print(textwrap.fill(str(response), 100))
response = query_engine.query("What vector database is being used?") print(textwrap.fill(str(response), 100))
Epsilla is the vector database being used.