LanceDB 向量存储¶

在本notebook中，我们将展示如何在 LlamaIndex 中使用 LanceDB 执行向量搜索

如果您正在 colab 中打开此 Notebook，您可能需要安装 LlamaIndex 🦙。

In [ ]

已复制！

%pip install llama-index llama-index-vector-stores-lancedb
%pip install llama-index llama-index-vector-stores-lancedb

In [ ]

已复制！

%pip install lancedb==0.6.13 #Only required if the above cell installs an older version of lancedb (pypi package may not be released yet)
%pip install lancedb==0.6.13 #只有当上面的单元格安装了旧版本的 lancedb 时才需要（pypi 包可能尚未发布）

In [ ]

已复制！

# Refresh vector store URI if restarting or re-using the same notebook
! rm -rf ./lancedb
# 如果重新启动或重复使用同一个 notebook，请刷新向量存储 URI ！ rm -rf ./lancedb

In [ ]

已复制！

import logging
import sys

# Uncomment to see debug logs
# logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
# logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

from llama_index.core import SimpleDirectoryReader, Document, StorageContext
from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.lancedb import LanceDBVectorStore
import textwrap
import logging import sys # 取消注释以查看调试日志 # logging.basicConfig(stream=sys.stdout, level=logging.DEBUG) # logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout)) from llama_index.core import SimpleDirectoryReader, Document, StorageContext from llama_index.core import VectorStoreIndex from llama_index.vector_stores.lancedb import LanceDBVectorStore import textwrap

设置 OpenAI¶

第一步是配置 openai 密钥。它将用于为加载到索引中的文档创建嵌入

In [ ]

已复制！

import openai

openai.api_key = "sk-"
import openai openai.api_key = "sk-"

下载数据

In [ ]

已复制！

!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
!mkdir -p 'data/paul_graham/' !wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

--2024-06-11 16:42:37--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75042 (73K) [text/plain]
Saving to: ‘data/paul_graham/paul_graham_essay.txt’

data/paul_graham/pa 100%[===================>]  73.28K  --.-KB/s    in 0.02s   

2024-06-11 16:42:37 (3.97 MB/s) - ‘data/paul_graham/paul_graham_essay.txt’ saved [75042/75042]

加载文档¶

使用 SimpleDirectoryReader 加载存储在 data/paul_graham/ 中的文档

In [ ]

已复制！

documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
print("Document ID:", documents[0].doc_id, "Document Hash:", documents[0].hash)
documents = SimpleDirectoryReader("./data/paul_graham/").load_data() print("Document ID:", documents[0].doc_id, "Document Hash:", documents[0].hash)

Document ID: cac1ba78-5007-4cf8-89ba-280264790115 Document Hash: fe2d4d3ef3a860780f6c2599808caa587c8be6516fe0ba4ca53cf117044ba953

创建索引¶

在这里，我们使用之前加载的文档创建了一个由 LanceDB 支持的索引。LanceDBVectorStore 接受一些参数。

uri (str, 必填): LanceDB 存储文件的位置。
table_name (str, 可选): 存储嵌入的表名。默认为 "vectors"。
nprobes (int, 可选): 使用的探针数量。数字越高，搜索越准确，但速度也越慢。默认为 20。
refine_factor: (int, 可选): 通过读取额外的元素并在内存中重新排序来优化结果。默认为 None
更多详细信息请参阅 LanceDB 文档

对于 LanceDB 云版：¶

vector_store = LanceDBVectorStore( 
    uri="db://db_name", # your remote DB URI
    api_key="sk_..", # lancedb cloud api key
    region="your-region" # the region you configured
    ...
)

In [ ]

已复制！





vector_store = LanceDBVectorStore(
    uri="./lancedb", mode="overwrite", query_type="hybrid"
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)
vector_store = LanceDBVectorStore( uri="./lancedb", mode="overwrite", query_type="hybrid" ) storage_context = StorageContext.from_defaults(vector_store=vector_store) index = VectorStoreIndex.from_documents( documents, storage_context=storage_context )

查询索引¶

现在我们可以使用索引提问。我们可以通过 MetadataFilters 进行过滤，或使用原生的 lance where 子句。

In [ ]

已复制！





from llama_index.core.vector_stores import (
    MetadataFilters,
    FilterOperator,
    FilterCondition,
    MetadataFilter,
)

from datetime import datetime


query_filters = MetadataFilters(
    filters=[
        MetadataFilter(
            key="creation_date",
            operator=FilterOperator.EQ,
            value=datetime.now().strftime("%Y-%m-%d"),
        ),
        MetadataFilter(
            key="file_size", value=75040, operator=FilterOperator.GT
        ),
    ],
    condition=FilterCondition.AND,
)
from llama_index.core.vector_stores import ( MetadataFilters, FilterOperator, FilterCondition, MetadataFilter, ) from datetime import datetime query_filters = MetadataFilters( filters=[ MetadataFilter( key="creation_date", operator=FilterOperator.EQ, value=datetime.now().strftime("%Y-%m-%d"), ), MetadataFilter( key="file_size", value=75040, operator=FilterOperator.GT ), ], condition=FilterCondition.AND, )

混合搜索¶

LanceDB 提供具有重排序功能的混合搜索。有关完整的文档，请参阅此处。

本示例使用 colbert 重排序器。以下单元格将安装 colbert 所需的依赖项。如果您选择不同的重排序器，请相应调整依赖项。

In [ ]

已复制！

! pip install -U torch transformers tantivy@git+https://github.com/quickwit-oss/tantivy-py#164adc87e1a033117001cf70e38c82a53014d985
! pip install -U torch transformers tantivy@git+https://github.com/quickwit-oss/tantivy-py#164adc87e1a033117001cf70e38c82a53014d985

如果您想在向量存储初始化时添加重排序器，可以在参数中像下面这样传递

from lancedb.rerankers import ColbertReranker
reranker = ColbertReranker()
vector_store = LanceDBVectorStore(uri="./lancedb", reranker=reranker, mode="overwrite")

In [ ]

已复制！

import lancedb
import lancedb

In [ ]

已复制！





from lancedb.rerankers import ColbertReranker

reranker = ColbertReranker()
vector_store._add_reranker(reranker)

query_engine = index.as_query_engine(
    filters=query_filters,
    # vector_store_kwargs={
    #     "query_type": "fts",
    # },
)

response = query_engine.query("How much did Viaweb charge per month?")
from lancedb.rerankers import ColbertReranker reranker = ColbertReranker() vector_store._add_reranker(reranker) query_engine = index.as_query_engine( filters=query_filters, # vector_store_kwargs={ # "query_type": "fts", # }, ) response = query_engine.query("How much did Viaweb charge per month?")

In [ ]

已复制！

print(response)
print("metadata -", response.metadata)
print(response) print("metadata -", response.metadata)

Viaweb charged $100 a month for a small store and $300 a month for a big one.
metadata - {'65ed5f07-5b8a-4143-a939-e8764884828e': {'file_path': '/Users/raghavdixit/Desktop/open_source/llama_index_lance/docs/docs/examples/vector_stores/data/paul_graham/paul_graham_essay.txt', 'file_name': 'paul_graham_essay.txt', 'file_type': 'text/plain', 'file_size': 75042, 'creation_date': '2024-06-11', 'last_modified_date': '2024-06-11'}, 'be231827-20b8-4988-ac75-94fa79b3c22e': {'file_path': '/Users/raghavdixit/Desktop/open_source/llama_index_lance/docs/docs/examples/vector_stores/data/paul_graham/paul_graham_essay.txt', 'file_name': 'paul_graham_essay.txt', 'file_type': 'text/plain', 'file_size': 75042, 'creation_date': '2024-06-11', 'last_modified_date': '2024-06-11'}}

通过 `where` 子句直接进行 lance 过滤（类似 SQL）：¶

In [ ]

已复制！

lance_filter = "metadata.file_name = 'paul_graham_essay.txt' "
retriever = index.as_retriever(vector_store_kwargs={"where": lance_filter})
response = retriever.retrieve("What did the author do growing up?")
lance_filter = "metadata.file_name = 'paul_graham_essay.txt' " retriever = index.as_retriever(vector_store_kwargs={"where": lance_filter}) response = retriever.retrieve("What did the author do growing up?")

In [ ]

已复制！

print(response[0].get_content())
print("metadata -", response[0].metadata)
print(response[0].get_content()) print("metadata -", response[0].metadata)

What I Worked On

February 2021

Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.

The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.

The language we used was an early version of Fortran. You had to type programs on punch cards, then stack them in the card reader and press a button to load the program into memory and run it. The result would ordinarily be to print something on the spectacularly loud printer.

I was puzzled by the 1401. I couldn't figure out what to do with it. And in retrospect there's not much I could have done with it. The only form of input to programs was data stored on punched cards, and I didn't have any data stored on punched cards. The only other option was to do things that didn't rely on any input, like calculate approximations of pi, but I didn't know enough math to do anything interesting of that type. So I'm not surprised I can't remember any programs I wrote, because they can't have done much. My clearest memory is of the moment I learned it was possible for programs not to terminate, when one of mine didn't. On a machine without time-sharing, this was a social as well as a technical error, as the data center manager's expression made clear.

With microcomputers, everything changed. Now you could have a computer sitting right in front of you, on a desk, that could respond to your keystrokes as it was running instead of just churning through a stack of punch cards and then stopping. [1]

The first of my friends to get a microcomputer built it himself. It was sold as a kit by Heathkit. I remember vividly how impressed and envious I felt watching him sitting in front of it, typing programs right into the computer.

Computers were expensive in those days and it took me years of nagging before I convinced my father to buy one, a TRS-80, in about 1980. The gold standard then was the Apple II, but a TRS-80 was good enough. This was when I really started programming. I wrote simple games, a program to predict how high my model rockets would fly, and a word processor that my father used to write at least one book. There was only room in memory for about 2 pages of text, so he'd write 2 pages at a time and then print them out, but it was a lot better than a typewriter.

Though I liked programming, I didn't plan to study it in college. In college I was going to study philosophy, which sounded much more powerful. It seemed, to my naive high school self, to be the study of the ultimate truths, compared to which the things studied in other fields would be mere domain knowledge. What I discovered when I got to college was that the other fields took up so much of the space of ideas that there wasn't much left for these supposed ultimate truths. All that seemed left for philosophy were edge cases that people in other fields felt could safely be ignored.

I couldn't have put this into words when I was 18. All I knew at the time was that I kept taking philosophy courses and they kept being boring. So I decided to switch to AI.

AI was in the air in the mid 1980s, but there were two things especially that made me want to work on it: a novel by Heinlein called The Moon is a Harsh Mistress, which featured an intelligent computer called Mike, and a PBS documentary that showed Terry Winograd using SHRDLU. I haven't tried rereading The Moon is a Harsh Mistress, so I don't know how well it has aged, but when I read it I was drawn entirely into its world.
metadata - {'file_path': '/Users/raghavdixit/Desktop/open_source/llama_index_lance/docs/docs/examples/vector_stores/data/paul_graham/paul_graham_essay.txt', 'file_name': 'paul_graham_essay.txt', 'file_type': 'text/plain', 'file_size': 75042, 'creation_date': '2024-06-11', 'last_modified_date': '2024-06-11'}

追加数据¶

您也可以向现有索引添加数据

In [ ]

已复制！

nodes = [node.node for node in response]
nodes = [node.node for node in response]

In [ ]

已复制！

del index

index = VectorStoreIndex.from_documents(
    [Document(text="The sky is purple in Portland, Maine")],
    uri="/tmp/new_dataset",
)
del index index = VectorStoreIndex.from_documents( [Document(text="The sky is purple in Portland, Maine")], uri="/tmp/new_dataset", )

In [ ]

已复制！

index.insert_nodes(nodes)
index.insert_nodes(nodes)

In [ ]

已复制！

query_engine = index.as_query_engine()
response = query_engine.query("Where is the sky purple?")
print(textwrap.fill(str(response), 100))
query_engine = index.as_query_engine() response = query_engine.query("Where is the sky purple?") print(textwrap.fill(str(response), 100))

Portland, Maine

您也可以从现有表中创建索引

In [ ]

已复制！

del index

vec_store = LanceDBVectorStore.from_table(vector_store._table)
index = VectorStoreIndex.from_vector_store(vec_store)
del index vec_store = LanceDBVectorStore.from_table(vector_store._table) index = VectorStoreIndex.from_vector_store(vec_store)

In [ ]

已复制！

query_engine = index.as_query_engine()
response = query_engine.query("What companies did the author start?")
print(textwrap.fill(str(response), 100))
query_engine = index.as_query_engine() response = query_engine.query("What companies did the author start?") print(textwrap.fill(str(response), 100))

The author started Viaweb and Aspra.

LanceDB 向量存储¶

设置 OpenAI¶

加载文档¶

创建索引¶

对于 LanceDB 云版：¶

查询索引¶

混合搜索¶

通过 where 子句直接进行 lance 过滤（类似 SQL）：¶

追加数据¶

通过 `where` 子句直接进行 lance 过滤（类似 SQL）：¶