LlamaCloud

Pinecone 向量存储 - 混合搜索¶

如果您在 Colab 上打开此 Notebook，可能需要安装 LlamaIndex 🦙。

In [ ]

%pip install llama-index-vector-stores-pinecone "transformers[torch]"
已复制！

%pip install llama-index-vector-stores-pinecone "transformers[torch]"

如果您在 Colab 上打开此 Notebook，可能需要安装 LlamaIndex 🦙。

In [ ]

from pinecone import Pinecone, ServerlessSpec
创建 Pinecone 索引¶

如果您在 Colab 上打开此 Notebook，可能需要安装 LlamaIndex 🦙。

In [ ]

import os

os.environ["PINECONE_API_KEY"] = "..."
os.environ["OPENAI_API_KEY"] = "sk-..."

api_key = os.environ["PINECONE_API_KEY"]

pc = Pinecone(api_key=api_key)
from pinecone import Pinecone, ServerlessSpec import os os.environ["PINECONE_API_KEY"] = "..." os.environ["OPENAI_API_KEY"] = "sk-..." api_key = os.environ["PINECONE_API_KEY"] pc = Pinecone(api_key=api_key)

如果您在 Colab 上打开此 Notebook，可能需要安装 LlamaIndex 🦙。

In [ ]

# delete if needed
pc.delete_index("quickstart")
# delete if needed pc.delete_index("quickstart")

如果您在 Colab 上打开此 Notebook，可能需要安装 LlamaIndex 🦙。

In [ ]





# dimensions are for text-embedding-ada-002
# NOTE: needs dotproduct for hybrid search

pc.create_index(
    name="quickstart",
    dimension=1536,
    metric="dotproduct",
    spec=ServerlessSpec(cloud="aws", region="us-east-1"),
)

# If you need to create a PodBased Pinecone index, you could alternatively do this:
#
# from pinecone import Pinecone, PodSpec
#
# pc = Pinecone(api_key='xxx')
#
# pc.create_index(
# 	 name='my-index',
# 	 dimension=1536,
# 	 metric='cosine',
# 	 spec=PodSpec(
# 		 environment='us-east1-gcp',
# 		 pod_type='p1.x1',
# 		 pods=1
# 	 )
# )
#
# dimensions are for text-embedding-ada-002 # NOTE: needs dotproduct for hybrid search pc.create_index( name="quickstart", dimension=1536, metric="dotproduct", spec=ServerlessSpec(cloud="aws", region="us-east-1"), ) # 如果您需要创建基于 Pod 的 Pinecone 索引，您可以选择执行以下操作： # # from pinecone import Pinecone, PodSpec # # pc = Pinecone(api_key='xxx') # # pc.create_index( # name='my-index', # dimension=1536, # metric='cosine', # spec=PodSpec( # environment='us-east1-gcp', # pod_type='p1.x1', # pods=1 # ) # ) #

如果您在 Colab 上打开此 Notebook，可能需要安装 LlamaIndex 🦙。

In [ ]

pinecone_index = pc.Index("quickstart")
pinecone_index = pc.Index("quickstart")

下载数据

如果您在 Colab 上打开此 Notebook，可能需要安装 LlamaIndex 🦙。

In [ ]

!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
!mkdir -p 'data/paul_graham/' !wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

加载文档，构建 PineconeVectorStore¶

当 add_sparse_vector=True 时，PineconeVectorStore 将为每个文档计算稀疏向量。

默认情况下，它使用简单的 token 频率来计算稀疏向量。但是，您也可以指定自定义的稀疏嵌入模型。

如果您在 Colab 上打开此 Notebook，可能需要安装 LlamaIndex 🦙。

In [ ]

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.vector_stores.pinecone import PineconeVectorStore
from IPython.display import Markdown, display
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader from llama_index.vector_stores.pinecone import PineconeVectorStore from IPython.display import Markdown, display

如果您在 Colab 上打开此 Notebook，可能需要安装 LlamaIndex 🦙。

In [ ]

# load documents
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
# load documents documents = SimpleDirectoryReader("./data/paul_graham/").load_data()

如果您在 Colab 上打开此 Notebook，可能需要安装 LlamaIndex 🦙。

In [ ]





# set add_sparse_vector=True to compute sparse vectors during upsert
from llama_index.core import StorageContext

if "OPENAI_API_KEY" not in os.environ:
    raise EnvironmentError(f"Environment variable OPENAI_API_KEY is not set")

vector_store = PineconeVectorStore(
    pinecone_index=pinecone_index,
    add_sparse_vector=True,
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)
# 设置 add_sparse_vector=True 以在 upsert 期间计算稀疏向量 from llama_index.core import StorageContext if "OPENAI_API_KEY" not in os.environ: raise EnvironmentError(f"Environment variable OPENAI_API_KEY is not set") vector_store = PineconeVectorStore( pinecone_index=pinecone_index, add_sparse_vector=True, ) storage_context = StorageContext.from_defaults(vector_store=vector_store) index = VectorStoreIndex.from_documents( documents, storage_context=storage_context )

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

Upserted vectors:   0%|          | 0/22 [00:00<?, ?it/s]

查询索引¶

可能需要等待一两分钟让索引准备就绪

如果您在 Colab 上打开此 Notebook，可能需要安装 LlamaIndex 🦙。

In [ ]

# set Logging to DEBUG for more detailed outputs
query_engine = index.as_query_engine(vector_store_query_mode="hybrid")
response = query_engine.query("What happened at Viaweb?")
# 将日志级别设置为 DEBUG 以获得更详细的输出 query_engine = index.as_query_engine(vector_store_query_mode="hybrid") response = query_engine.query("What happened at Viaweb?")

如果您在 Colab 上打开此 Notebook，可能需要安装 LlamaIndex 🦙。

In [ ]

display(Markdown(f"<b>{response}</b>"))
display(Markdown(f"{response}"))

Paul Graham 创办 Viaweb 是因为他需要钱。随着公司发展，他意识到自己不想经营一家大公司，于是决定将愿景的一部分作为开源项目来构建。最终，Viaweb 在 1998 年夏天被 Yahoo 收购，这对 Paul Graham 来说是巨大的解脱。

更改稀疏嵌入模型¶

如果您在 Colab 上打开此 Notebook，可能需要安装 LlamaIndex 🦙。

In [ ]

%pip install llama-index-sparse-embeddings-fastembed
%pip install llama-index-sparse-embeddings-fastembed

如果您在 Colab 上打开此 Notebook，可能需要安装 LlamaIndex 🦙。

In [ ]

# Clear the vector store
vector_store.clear()
# 清空向量存储 vector_store.clear()

如果您在 Colab 上打开此 Notebook，可能需要安装 LlamaIndex 🦙。

In [ ]





from llama_index.sparse_embeddings.fastembed import FastEmbedSparseEmbedding

sparse_embedding_model = FastEmbedSparseEmbedding(
    model_name="prithivida/Splade_PP_en_v1"
)

vector_store = PineconeVectorStore(
    pinecone_index=pinecone_index,
    add_sparse_vector=True,
    sparse_embedding_model=sparse_embedding_model,
)
from llama_index.sparse_embeddings.fastembed import FastEmbedSparseEmbedding sparse_embedding_model = FastEmbedSparseEmbedding( model_name="prithivida/Splade_PP_en_v1" ) vector_store = PineconeVectorStore( pinecone_index=pinecone_index, add_sparse_vector=True, sparse_embedding_model=sparse_embedding_model, )

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

如果您在 Colab 上打开此 Notebook，可能需要安装 LlamaIndex 🦙。

In [ ]

index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)
index = VectorStoreIndex.from_documents( documents, storage_context=storage_context )

Upserted vectors:   0%|          | 0/22 [00:00<?, ?it/s]

等待一分钟以便上传完成...

如果您在 Colab 上打开此 Notebook，可能需要安装 LlamaIndex 🦙。

In [ ]

response = query_engine.query("What happened at Viaweb?")
display(Markdown(f"<b>{response}</b>"))
response = query_engine.query("What happened at Viaweb?") display(Markdown(f"{response}"))

Paul Graham 创办 Viaweb 是因为他需要钱。他招募了一个团队来构建软件和服务，重点是创建一个应用构建器和网络基础设施。然而，在夏季过半时，Paul 意识到他不想经营一家大公司，决定将项目的一部分转为开源项目。这促成了一种新的 Lisp 方言 Arc 的开发。最终，Viaweb 在 1998 年夏天被卖给了 Yahoo，这让 Paul Graham 松了一口气，也让他得以进入人生的新阶段。