文档存储演示¶

本指南展示了如何直接使用我们的 DocumentStore 抽象。通过将节点放入文档存储，您可以在同一个底层文档存储上定义多个索引，而无需在不同索引中复制数据。

如果您正在 Colab 上打开此 Notebook，您可能需要安装 LlamaIndex 🦙。

In [ ]

已复制！

%pip install llama-index-llms-openai
%pip install llama-index-llms-openai

In [ ]

已复制！

!pip install llama-index
!pip install llama-index

In [ ]

已复制！

import nest_asyncio

nest_asyncio.apply()
import nest_asyncio nest_asyncio.apply()

In [ ]

已复制！

import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
import logging import sys logging.basicConfig(stream=sys.stdout, level=logging.INFO) logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

In [ ]

已复制！





from llama_index.core import SimpleDirectoryReader
from llama_index.core import VectorStoreIndex, SimpleKeywordTableIndex
from llama_index.core import SummaryIndex
from llama_index.core import ComposableGraph
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings
from llama_index.core import SimpleDirectoryReader from llama_index.core import VectorStoreIndex, SimpleKeywordTableIndex from llama_index.core import SummaryIndex from llama_index.core import ComposableGraph from llama_index.llms.openai import OpenAI from llama_index.core import Settings

下载数据¶

In [ ]

已复制！

!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
!mkdir -p 'data/paul_graham/' !wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

加载文档¶

In [ ]

已复制！

reader = SimpleDirectoryReader("./data/paul_graham/")
documents = reader.load_data()
reader = SimpleDirectoryReader("./data/paul_graham/") documents = reader.load_data()

解析为节点¶

In [ ]

已复制！

from llama_index.core.node_parser import SentenceSplitter

nodes = SentenceSplitter().get_nodes_from_documents(documents)
from llama_index.core.node_parser import SentenceSplitter nodes = SentenceSplitter().get_nodes_from_documents(documents)

添加到文档存储¶

In [ ]

已复制！

from llama_index.core.storage.docstore import SimpleDocumentStore

docstore = SimpleDocumentStore()
docstore.add_documents(nodes)
from llama_index.core.storage.docstore import SimpleDocumentStore docstore = SimpleDocumentStore() docstore.add_documents(nodes)

定义多个索引¶

每个索引都使用相同的底层节点。

In [ ]

已复制！

from llama_index.core import StorageContext

storage_context = StorageContext.from_defaults(docstore=docstore)
summary_index = SummaryIndex(nodes, storage_context=storage_context)
vector_index = VectorStoreIndex(nodes, storage_context=storage_context)
keyword_table_index = SimpleKeywordTableIndex(
    nodes, storage_context=storage_context
)
from llama_index.core import StorageContext storage_context = StorageContext.from_defaults(docstore=docstore) summary_index = SummaryIndex(nodes, storage_context=storage_context) vector_index = VectorStoreIndex(nodes, storage_context=storage_context) keyword_table_index = SimpleKeywordTableIndex( nodes, storage_context=storage_context )

In [ ]

已复制！

# NOTE: the docstore still has the same nodes
len(storage_context.docstore.docs)
# NOTE: the docstore still has the same nodes len(storage_context.docstore.docs)

Out [ ]

测试一些查询¶

In [ ]

已复制！

llm = OpenAI(temperature=0, model="gpt-3.5-turbo")

Settings.llm = llm
Settings.chunk_size = 1024
llm = OpenAI(temperature=0, model="gpt-3.5-turbo") Settings.llm = llm Settings.chunk_size = 1024

WARNING:llama_index.llm_predictor.base:Unknown max input size for gpt-3.5-turbo, using defaults.
Unknown max input size for gpt-3.5-turbo, using defaults.

In [ ]

已复制！

query_engine = summary_index.as_query_engine()
response = query_engine.query("What is a summary of this document?")
query_engine = summary_index.as_query_engine() response = query_engine.query("What is a summary of this document?")

In [ ]

已复制！

query_engine = vector_index.as_query_engine()
response = query_engine.query("What did the author do growing up?")
query_engine = vector_index.as_query_engine() response = query_engine.query("What did the author do growing up?")

In [ ]

已复制！

query_engine = keyword_table_index.as_query_engine()
response = query_engine.query("What did the author do after his time at YC?")
query_engine = keyword_table_index.as_query_engine() response = query_engine.query("What did the author do after his time at YC?")

In [ ]

已复制！

print(response)
print(response)

After his time at YC, the author decided to take a break and focus on painting. He spent most of 2014 painting and then, in November, he ran out of steam and stopped. He then moved to Florence, Italy to attend the Accademia di Belle Arti di Firenze, where he studied painting and drawing. He also started painting still lives in his bedroom at night. In March 2015, he started working on Lisp again and wrote a new Lisp, called Bel, in itself in Arc. He wrote essays through 2020, but also started to think about other things he could work on. He wrote an essay for himself to answer the question of how he should choose what to do next and then wrote a more detailed version for others to read. He also created the Y Combinator logo, which was an inside joke referencing the Viaweb logo, a white V on a red circle, so he made the YC logo a white Y on an orange square. He also created a fund for YC for a couple of years, but after Heroku got bought, he had enough money to go back to being self-funded. He also disliked the term "deal flow" because it implies that the number of new startups at any given time