使用 LlamaIndex 和 KDB.AI 向量存储实现带有时间过滤的高级 RAG¶
注意:此示例需要 KDB.AI 端点和 API 密钥。注册免费的 KDB.AI 账户。¶
KDB.AI 是一款强大的基于知识的向量数据库和搜索引擎,通过提供高级搜索、推荐和个性化功能,可帮助您使用实时数据构建可扩展、可靠的 AI 应用。
此示例演示如何使用 KDB.AI 对特定时间点附近的金融法规进行语义搜索、摘要和分析。
要访问您的端点和 API 密钥,请在此注册 KDB.AI。
要设置您的开发环境,请遵循 KDB.AI 先决条件页面上的说明。
以下示例演示了通过 LlamaIndex 与 KDB.AI 交互的一些方式。
In [ ]
已复制!
!pip install llama-index llama-index-llms-openai llama-index-embeddings-openai llama-index-readers-file llama-index-vector-stores-kdbai
!pip install kdbai_client pandas
!pip install llama-index llama-index-llms-openai llama-index-embeddings-openai llama-index-readers-file llama-index-vector-stores-kdbai !pip install kdbai_client pandas
导入依赖项¶
In [ ]
已复制!
from getpass import getpass
import re
import os
import shutil
import time
import urllib
import datetime
import pandas as pd
from llama_index.core import (
Settings,
SimpleDirectoryReader,
StorageContext,
VectorStoreIndex,
)
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.vector_stores.kdbai import KDBAIVectorStore
import kdbai_client as kdbai
OUTDIR = "pdf"
RESET = True
from getpass import getpass import re import os import shutil import time import urllib import datetime import pandas as pd from llama_index.core import ( Settings, SimpleDirectoryReader, StorageContext, VectorStoreIndex, ) from llama_index.core.node_parser import SentenceSplitter from llama_index.core.retrievers import VectorIndexRetriever from llama_index.embeddings.openai import OpenAIEmbedding from llama_index.llms.openai import OpenAI from llama_index.vector_stores.kdbai import KDBAIVectorStore import kdbai_client as kdbai OUTDIR = "pdf" RESET = True
设置 OpenAI API 密钥并选择要使用的 LLM 和 Embedding 模型:¶
In [ ]
已复制!
# os.environ["OPENAI_API_KEY"] = getpass("OpenAI API key: ")
os.environ["OPENAI_API_KEY"] = (
os.environ["OPENAI_API_KEY"]
if "OPENAI_API_KEY" in os.environ
else getpass("OpenAI API Key: ")
)
# os.environ["OPENAI_API_KEY"] = getpass("OpenAI API key: ") os.environ["OPENAI_API_KEY"] = ( os.environ["OPENAI_API_KEY"] if "OPENAI_API_KEY" in os.environ else getpass("OpenAI API Key: ") )
In [ ]
已复制!
import os
from getpass import getpass
# Set OpenAI API
if "OPENAI_API_KEY" in os.environ:
KDBAI_API_KEY = os.environ["OPENAI_API_KEY"]
else:
# Prompt the user to enter the API key
OPENAI_API_KEY = getpass("OPENAI API KEY: ")
# Save the API key as an environment variable for the current session
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
import os from getpass import getpass # Set OpenAI API if "OPENAI_API_KEY" in os.environ: KDBAI_API_KEY = os.environ["OPENAI_API_KEY"] else: # Prompt the user to enter the API key OPENAI_API_KEY = getpass("OPENAI API Key: ") # Save the API key as an environment variable for the current session os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
In [ ]
已复制!
EMBEDDING_MODEL = "text-embedding-3-small"
GENERATION_MODEL = "gpt-4o-mini"
llm = OpenAI(model=GENERATION_MODEL)
embed_model = OpenAIEmbedding(model=EMBEDDING_MODEL)
Settings.llm = llm
Settings.embed_model = embed_model
EMBEDDING_MODEL = "text-embedding-3-small" GENERATION_MODEL = "gpt-4o-mini" llm = OpenAI(model=GENERATION_MODEL) embed_model = OpenAIEmbedding(model=EMBEDDING_MODEL) Settings.llm = llm Settings.embed_model = embed_model
创建 KDB.AI 会话和表¶
In [ ]
已复制!
# vector DB imports
import os
from getpass import getpass
import kdbai_client as kdbai
import time
# vector DB imports import os from getpass import getpass import kdbai_client as kdbai import time
In [ ]
已复制!
# Set up KDB.AI endpoint and API key
KDBAI_ENDPOINT = (
os.environ["KDBAI_ENDPOINT"]
if "KDBAI_ENDPOINT" in os.environ
else input("KDB.AI endpoint: ")
)
KDBAI_API_KEY = (
os.environ["KDBAI_API_KEY"]
if "KDBAI_API_KEY" in os.environ
else getpass("KDB.AI API key: ")
)
session = kdbai.Session(endpoint=KDBAI_ENDPOINT, api_key=KDBAI_API_KEY)
# Set up KDB.AI endpoint and API key KDBAI_ENDPOINT = ( os.environ["KDBAI_ENDPOINT"] if "KDBAI_ENDPOINT" in os.environ else input("KDB.AI endpoint: ") ) KDBAI_API_KEY = ( os.environ["KDBAI_API_KEY"] if "KDBAI_API_KEY" in os.environ else getpass("KDB.AI API key: ") ) session = kdbai.Session(endpoint=KDBAI_ENDPOINT, api_key=KDBAI_API_KEY)
In [ ]
已复制!
# session = kdbai.Session()
# session = kdbai.Session()
创建您的 KDB.AI 表的模式¶
!!! 注意: embedding 列中的 'dims' 参数必须反映您选择的 embedding 模型的输出维度。
- OpenAI 'text-embedding-3-small' 的输出维度为 1536。
In [ ]
已复制!
schema = [
{"name": "document_id", "type": "bytes"},
{"name": "text", "type": "bytes"},
{"name": "embeddings", "type": "float32s"},
{"name": "title", "type": "str"},
{"name": "publication_date", "type": "datetime64[ns]"},
]
indexFlat = {
"name": "flat_index",
"type": "flat",
"column": "embeddings",
"params": {"dims": 1536, "metric": "L2"},
}
schema = [ {"name": "document_id", "type": "bytes"}, {"name": "text", "type": "bytes"}, {"name": "embeddings", "type": "float32s"}, {"name": "title", "type": "str"}, {"name": "publication_date", "type": "datetime64[ns]"}, ] indexFlat = { "name": "flat_index", "type": "flat", "column": "embeddings", "params": {"dims": 1536, "metric": "L2"}, }
In [ ]
已复制!
KDBAI_TABLE_NAME = "reports"
database = session.database("default")
# First ensure the table does not already exist
for table in database.tables:
if table.name == KDBAI_TABLE_NAME:
table.drop()
break
# Create the table
table = database.create_table(
KDBAI_TABLE_NAME, schema=schema, indexes=[indexFlat]
)
KDBAI_TABLE_NAME = "reports" database = session.database("default") # First ensure the table does not already exist for table in database.tables: if table.name == KDBAI_TABLE_NAME: table.drop() break # Create the table table = database.create_table( KDBAI_TABLE_NAME, schema=schema, indexes=[indexFlat] )
财务报告 URL 和元数据¶
In [ ]
已复制!
INPUT_URLS = [
"https://www.govinfo.gov/content/pkg/PLAW-106publ102/pdf/PLAW-106publ102.pdf",
"https://www.govinfo.gov/content/pkg/PLAW-111publ203/pdf/PLAW-111publ203.pdf",
]
METADATA = {
"pdf/PLAW-106publ102.pdf": {
"title": "GRAMM–LEACH–BLILEY ACT, 1999",
"publication_date": pd.to_datetime("1999-11-12"),
},
"pdf/PLAW-111publ203.pdf": {
"title": "DODD-FRANK WALL STREET REFORM AND CONSUMER PROTECTION ACT, 2010",
"publication_date": pd.to_datetime("2010-07-21"),
},
}
INPUT_URLS = [ "https://www.govinfo.gov/content/pkg/PLAW-106publ102/pdf/PLAW-106publ102.pdf", "https://www.govinfo.gov/content/pkg/PLAW-111publ203/pdf/PLAW-111publ203.pdf", ] METADATA = { "pdf/PLAW-106publ102.pdf": { "title": "GRAMM–LEACH–BLILEY ACT, 1999", "publication_date": pd.to_datetime("1999-11-12"), }, "pdf/PLAW-111publ203.pdf": { "title": "DODD-FRANK WALL STREET REFORM AND CONSUMER PROTECTION ACT, 2010", "publication_date": pd.to_datetime("2010-07-21"), }, }
在本地下载 PDF 文件¶
In [ ]
已复制!
%%time
CHUNK_SIZE = 512 * 1024
def download_file(url):
print("Downloading %s..." % url)
out = os.path.join(OUTDIR, os.path.basename(url))
try:
response = urllib.request.urlopen(url)
except urllib.error.URLError as e:
logging.exception("Failed to download %s !" % url)
else:
with open(out, "wb") as f:
while True:
chunk = response.read(CHUNK_SIZE)
if chunk:
f.write(chunk)
else:
break
return out
if RESET:
if os.path.exists(OUTDIR):
shutil.rmtree(OUTDIR)
os.mkdir(OUTDIR)
local_files = [download_file(x) for x in INPUT_URLS]
local_files[:10]
%%time CHUNK_SIZE = 512 * 1024 def download_file(url): print("Downloading %s..." % url) out = os.path.join(OUTDIR, os.path.basename(url)) try: response = urllib.request.urlopen(url) except urllib.error.URLError as e: logging.exception("Failed to download %s !" % url) else: with open(out, "wb") as f: while True: chunk = response.read(CHUNK_SIZE) if chunk: f.write(chunk) else: break return out if RESET: if os.path.exists(OUTDIR): shutil.rmtree(OUTDIR) os.mkdir(OUTDIR) local_files = [download_file(x) for x in INPUT_URLS] local_files[:10]
Downloading https://www.govinfo.gov/content/pkg/PLAW-106publ102/pdf/PLAW-106publ102.pdf...
Downloading https://www.govinfo.gov/content/pkg/PLAW-111publ203/pdf/PLAW-111publ203.pdf... CPU times: user 52.6 ms, sys: 1.2 ms, total: 53.8 ms Wall time: 7.86 s
使用 LlamaIndex 加载本地 PDF 文件¶
In [ ]
已复制!
%%time
def get_metadata(filepath):
return METADATA[filepath]
documents = SimpleDirectoryReader(
input_files=local_files,
file_metadata=get_metadata,
)
docs = documents.load_data()
len(docs)
%%time def get_metadata(filepath): return METADATA[filepath] documents = SimpleDirectoryReader( input_files=local_files, file_metadata=get_metadata, ) docs = documents.load_data() len(docs)
CPU times: user 8.22 s, sys: 9.04 ms, total: 8.23 s Wall time: 8.23 s
Out [ ]
994
使用 KDB.AI 向量存储设置 LlamaIndex RAG 管道¶
In [ ]
已复制!
%%time
# llm = OpenAI(temperature=0, model=LLM)
vector_store = KDBAIVectorStore(table)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
docs,
storage_context=storage_context,
transformations=[SentenceSplitter(chunk_size=2048, chunk_overlap=0)],
)
%%time # llm = OpenAI(temperature=0, model=LLM) vector_store = KDBAIVectorStore(table) storage_context = StorageContext.from_defaults(vector_store=vector_store) index = VectorStoreIndex.from_documents( docs, storage_context=storage_context, transformations=[SentenceSplitter(chunk_size=2048, chunk_overlap=0)], )
CPU times: user 3.67 s, sys: 31.9 ms, total: 3.7 s Wall time: 22.3 s
In [ ]
已复制!
table.query()
table.query()
Out [ ]
document_id | text | embeddings | title | publication_date | |
---|---|---|---|---|---|
0 | b'272d7d24-c232-41b6-823e-27aa6203c100' | b'PUBLIC LAW 106\xc2\xb1102\xc3\x90NOV. 12, 19... | [0.034452137, 0.03166917, -0.011892043, 0.0184... | GRAMM–LEACH–BLILEY ACT, 1999 | 1999-11-12 |
1 | b'89e3f2ee-f5a6-4e40-bb81-0632f08341f0' | b"113 STAT. 1338 PUBLIC LAW 106\xc2\xb1102\xc3... | [0.02164333, 1.0030156e-05, 0.0028665832, 0.02... | GRAMM–LEACH–BLILEY ACT, 1999 | 1999-11-12 |
2 | b'56fbe82a-5458-4a4a-a5ed-026d9399151d' | b'113 STAT. 1339 PUBLIC LAW 106\xc2\xb1102\xc3... | [0.01380091, 0.026945233, 0.02838467, 0.043132... | GRAMM–LEACH–BLILEY ACT, 1999 | 1999-11-12 |
3 | b'b6bf9e48-51b6-45d9-9259-b6346f93831f' | b'113 STAT. 1340 PUBLIC LAW 106\xc2\xb1102\xc3... | [0.0070182937, 0.014063503, 0.026525516, 0.040... | GRAMM–LEACH–BLILEY ACT, 1999 | 1999-11-12 |
4 | b'f398b133-b4f5-4a34-94d1-9a97fdb658e5' | b"113 STAT. 1341 PUBLIC LAW 106\xc2\xb1102\xc3... | [0.025041763, 0.01968024, 0.030940715, 0.02899... | GRAMM–LEACH–BLILEY ACT, 1999 | 1999-11-12 |
... | ... | ... | ... | ... | ... |
989 | b'8e84d1d5-d87d-4351-b7eb-5d569fdb8d9c' | b'124 STAT. 2219 PUBLIC LAW 111\xe2\x80\x93203... | [0.024505286, 0.015549232, 0.0536601, 0.028532... | DODD-FRANK WALL STREET REFORM AND CONSUMER PRO... | 2010-07-21 |
990 | b'0c47f590-050c-4374-bf8c-2a4502dc980f' | b'124 STAT. 2220 PUBLIC LAW 111\xe2\x80\x93203... | [0.014071382, -0.0044553108, 0.03662071, 0.035... | DODD-FRANK WALL STREET REFORM AND CONSUMER PRO... | 2010-07-21 |
991 | b'63a2235f-d368-43b8-a1a9-a5a11d497245' | b'124 STAT. 2221 PUBLIC LAW 111\xe2\x80\x93203... | [0.0005448305, 0.013075933, 0.044821188, 0.031... | DODD-FRANK WALL STREET REFORM AND CONSUMER PRO... | 2010-07-21 |
992 | b'bac4d75e-4867-4d89-a71e-09a6762bf3c4' | b'124 STAT. 2222 PUBLIC LAW 111\xe2\x80\x93203... | [0.032077603, 0.016817383, 0.04507993, 0.03376... | DODD-FRANK WALL STREET REFORM AND CONSUMER PRO... | 2010-07-21 |
993 | b'e262e4da-f6e1-4b9d-9232-77fc3f0c81a7' | b'124 STAT. 2223 PUBLIC LAW 111\xe2\x80\x93203... | [0.0387719, -0.025150038, 0.030345473, 0.04303... | DODD-FRANK WALL STREET REFORM AND CONSUMER PRO... | 2010-07-21 |
994 rows × 5 columns
设置 LlamaIndex 查询引擎¶
In [ ]
已复制!
%%time
# Using gpt-4o-mini, the 128k tokens context size can take 100 pages.
K = 15
query_engine = index.as_query_engine(
similarity_top_k=K,
vector_store_kwargs={
"index": "flat_index",
"filter": [["<", "publication_date", datetime.date(2008, 9, 15)]],
"sort_columns": "publication_date",
},
)
%%time # Using gpt-4o-mini, the 128k tokens context size can take 100 pages. K = 15 query_engine = index.as_query_engine( similarity_top_k=K, vector_store_kwargs={ "index": "flat_index", "filter": [["<", "publication_date", datetime.date(2008, 9, 15)]], "sort_columns": "publication_date", }, )
CPU times: user 512 μs, sys: 23 μs, total: 535 μs Wall time: 550 μs
2008 年危机前¶
In [ ]
已复制!
%%time
result = query_engine.query(
"""
What was the main financial regulation in the US before the 2008 financial crisis ?
"""
)
print(result.response)
%%time result = query_engine.query( """ What was the main financial regulation in the US before the 2008 financial crisis ? """ ) print(result.response)
The main financial regulation in the US before the 2008 financial crisis was the Gramm-Leach-Bliley Act, enacted in 1999. This act facilitated the affiliation among banks, securities firms, and insurance companies, effectively repealing parts of the Glass-Steagall Act, which had previously separated these financial services. The Gramm-Leach-Bliley Act aimed to enhance competition in the financial services industry by providing a framework for the integration of various financial institutions. CPU times: user 61.8 ms, sys: 0 ns, total: 61.8 ms Wall time: 4.24 s
In [ ]
已复制!
%%time
result = query_engine.query(
"""
Is the Gramm-Leach-Bliley Act of 1999 enough to prevent the 2008 crisis. Search the document and explain its strenghts and weaknesses to regulate the US stock market.
"""
)
print(result.response)
%%time result = query_engine.query( """ Is the Gramm-Leach-Bliley Act of 1999 enough to prevent the 2008 crisis. Search the document and explain its strenghts and weaknesses to regulate the US stock market. """ ) print(result.response)
The Gramm-Leach-Bliley Act of 1999 aimed to enhance competition in the financial services industry by allowing affiliations among banks, securities firms, and insurance companies. Its strengths include the repeal of the Glass-Steagall Act, which had previously separated commercial banking from investment banking, thereby enabling financial institutions to diversify their services and potentially increase competition. This diversification could lead to more innovative financial products and services. However, the Act also has notable weaknesses. By allowing greater affiliations and reducing regulatory barriers, it may have contributed to the creation of "too big to fail" institutions, which posed systemic risks to the financial system. The lack of stringent oversight and the ability for financial holding companies to engage in a wide range of activities without adequate regulation may have led to excessive risk-taking. Additionally, the Act did not sufficiently address the complexities of modern financial products, such as derivatives, which played a significant role in the 2008 financial crisis. In summary, while the Gramm-Leach-Bliley Act aimed to foster competition and innovation in the financial sector, its regulatory framework may have inadvertently facilitated the conditions that led to the financial crisis, highlighting the need for a more robust regulatory approach to oversee the interconnectedness and risks within the financial system. CPU times: user 45.7 ms, sys: 255 μs, total: 46 ms Wall time: 21.9 s
2008 年危机后¶
In [ ]
已复制!
%%time
# Using gpt-4o-mini, the 128k tokens context size can take 100 pages.
K = 15
query_engine = index.as_query_engine(
similarity_top_k=K,
vector_store_kwargs={
"index": "flat_index",
"filter": [[">=", "publication_date", datetime.date(2008, 9, 15)]],
"sort_columns": "publication_date",
},
)
%%time # Using gpt-4o-mini, the 128k tokens context size can take 100 pages. K = 15 query_engine = index.as_query_engine( similarity_top_k=K, vector_store_kwargs={ "index": "flat_index", "filter": [[">=", "publication_date", datetime.date(2008, 9, 15)]], "sort_columns": "publication_date", }, )
CPU times: user 171 μs, sys: 0 ns, total: 171 μs Wall time: 175 μs
In [ ]
已复制!
%%time
result = query_engine.query(
"""
What happened on the 15th of September 2008 ?
"""
)
print(result.response)
%%time result = query_engine.query( """ What happened on the 15th of September 2008 ? """ ) print(result.response)
On the 15th of September 2008, Lehman Brothers, a major global financial services firm, filed for bankruptcy. This event marked one of the largest bankruptcies in U.S. history and was a significant moment in the financial crisis of 2007-2008, leading to widespread panic in financial markets and contributing to the global economic downturn. CPU times: user 51.4 ms, sys: 0 ns, total: 51.4 ms Wall time: 3.6 s
In [ ]
已复制!
%%time
result = query_engine.query(
"""
What was the new US financial regulation enacted after the 2008 crisis to increase the market regulation and to improve consumer sentiment ?
"""
)
print(result.response)
%%time result = query_engine.query( """ What was the new US financial regulation enacted after the 2008 crisis to increase the market regulation and to improve consumer sentiment ? """ ) print(result.response)
The new US financial regulation enacted after the 2008 crisis to increase market regulation and improve consumer sentiment is the Dodd-Frank Wall Street Reform and Consumer Protection Act, which was signed into law on July 21, 2010. This legislation aimed to promote financial stability, enhance accountability and transparency in the financial system, and protect consumers from abusive financial practices. CPU times: user 43.7 ms, sys: 0 ns, total: 43.7 ms Wall time: 4.55 s
深入分析¶
In [ ]
已复制!
%%time
# Using gpt-4o-mini, the 128k tokens context size can take 100 pages.
K = 20
query_engine = index.as_query_engine(
similarity_top_k=K,
vector_store_kwargs={
"index": "flat_index",
"sort_columns": "publication_date",
},
)
%%time # Using gpt-4o-mini, the 128k tokens context size can take 100 pages. K = 20 query_engine = index.as_query_engine( similarity_top_k=K, vector_store_kwargs={ "index": "flat_index", "sort_columns": "publication_date", }, )
CPU times: user 227 μs, sys: 10 μs, total: 237 μs Wall time: 243 μs
In [ ]
已复制!
%%time
result = query_engine.query(
"""
Analyse the US financial regulations before and after the 2008 crisis and produce a report of all related arguments to explain what happened, and to ensure that does not happen again.
Use both the provided context and your own knowledge but do mention explicitely which one you use.
"""
)
print(result.response)
%%time result = query_engine.query( """ Analyse the US financial regulations before and after the 2008 crisis and produce a report of all related arguments to explain what happened, and to ensure that does not happen again. Use both the provided context and your own knowledge but do mention explicitely which one you use. """ ) print(result.response)
The analysis of U.S. financial regulations before and after the 2008 financial crisis reveals significant changes aimed at preventing a recurrence of such a crisis. Before the crisis, the regulatory framework was characterized by a lack of comprehensive oversight, particularly for nonbank financial institutions. The regulatory environment allowed for excessive risk-taking, inadequate capital requirements, and insufficient transparency in financial transactions. This environment contributed to the housing bubble and the subsequent collapse of major financial institutions, leading to widespread economic turmoil. In response to the crisis, the Dodd-Frank Wall Street Reform and Consumer Protection Act of 2010 was enacted. This legislation introduced several key reforms: 1. **Creation of the Financial Stability Oversight Council (FSOC)**: This body was established to monitor systemic risks and coordinate regulatory efforts across different financial sectors. It has the authority to recommend heightened standards and safeguards for financial activities that could pose risks to financial stability. 2. **Enhanced Regulatory Oversight**: Dodd-Frank imposed stricter regulations on bank holding companies and nonbank financial companies, particularly those with significant assets. This includes requirements for stress testing, capital planning, and the submission of resolution plans to ensure orderly wind-downs in case of failure. 3. **Consumer Protection Measures**: The establishment of the Consumer Financial Protection Bureau (CFPB) aimed to protect consumers from predatory lending practices and ensure transparency in financial products. 4. **Volcker Rule**: This provision restricts proprietary trading by banks and limits their investments in hedge funds and private equity funds, thereby reducing conflicts of interest and excessive risk-taking. 5. **Increased Transparency and Reporting Requirements**: Financial institutions are now required to disclose more information regarding their risk exposures and financial health, which enhances market discipline and investor confidence. The arguments for these reforms center around the need for a more resilient financial system that can withstand economic shocks. The reforms aim to address the systemic risks that were prevalent before the crisis, ensuring that financial institutions maintain adequate capital buffers and engage in prudent risk management practices. In conclusion, the regulatory landscape has shifted significantly since the 2008 crisis, with a focus on preventing excessive risk-taking, enhancing transparency, and protecting consumers. These measures are designed to create a more stable financial environment and mitigate the likelihood of future crises. CPU times: user 180 ms, sys: 437 μs, total: 180 ms Wall time: 10.5 s
删除 KDB.AI 表¶
使用完表后,最佳实践是将其删除。
In [ ]
已复制!
table.drop()
table.drop()