使用 OpenAI GPT4V 和 LanceDB 向量存储处理视频的多模态 RAG¶
在此笔记本中,我们展示了一种专为视频处理设计的多模态 RAG 架构。我们利用 OpenAI GPT4V 多模态 LLM 类,该类使用 CLIP 生成多模态嵌入。此外,我们使用 LanceDBVectorStore 进行高效的向量存储。
步骤
从 YouTube 下载视频,处理并存储。
为文本和图像构建多模态索引和向量存储。
检索相关图像和上下文,并使用它们来增强 prompt。
使用 GPT4V 推理输入查询和增强数据之间的关联,并生成最终响应。
In [ ]
已复制!
%pip install llama-index-vector-stores-lancedb
%pip install llama-index-multi-modal-llms-openai
%pip install llama-index-vector-stores-lancedb %pip install llama-index-multi-modal-llms-openai
In [ ]
已复制!
%pip install llama-index-multi-modal-llms-openai
%pip install llama-index-vector-stores-lancedb
%pip install llama-index-embeddings-clip
%pip install llama-index-multi-modal-llms-openai %pip install llama-index-vector-stores-lancedb %pip install llama-index-embeddings-clip
In [ ]
已复制!
%pip install llama_index ftfy regex tqdm
%pip install -U openai-whisper
%pip install git+https://github.com/openai/CLIP.git
%pip install torch torchvision
%pip install matplotlib scikit-image
%pip install lancedb
%pip install moviepy
%pip install pytube
%pip install pydub
%pip install SpeechRecognition
%pip install ffmpeg-python
%pip install soundfile
%pip install llama_index ftfy regex tqdm %pip install -U openai-whisper %pip install git+https://github.com/openai/CLIP.git %pip install torch torchvision %pip install matplotlib scikit-image %pip install lancedb %pip install moviepy %pip install pytube %pip install pydub %pip install SpeechRecognition %pip install ffmpeg-python %pip install soundfile
In [ ]
已复制!
from moviepy.editor import VideoFileClip
from pathlib import Path
import speech_recognition as sr
from pytube import YouTube
from pprint import pprint
from moviepy.editor import VideoFileClip from pathlib import Path import speech_recognition as sr from pytube import YouTube from pprint import pprint
In [ ]
已复制!
import os
OPENAI_API_KEY = ""
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
import os OPENAI_API_KEY = "" os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
在下方设置输入配置¶
In [ ]
已复制!
video_url = "https://www.youtube.com/watch?v=d_qvLDhkg00"
output_video_path = "./video_data/"
output_folder = "./mixed_data/"
output_audio_path = "./mixed_data/output_audio.wav"
filepath = output_video_path + "input_vid.mp4"
Path(output_folder).mkdir(parents=True, exist_ok=True)
video_url = "https://www.youtube.com/watch?v=d_qvLDhkg00" output_video_path = "./video_data/" output_folder = "./mixed_data/" output_audio_path = "./mixed_data/output_audio.wav" filepath = output_video_path + "input_vid.mp4" Path(output_folder).mkdir(parents=True, exist_ok=True)
下载视频并处理成适合生成/存储嵌入的格式¶
In [ ]
已复制!
from PIL import Image
import matplotlib.pyplot as plt
import os
def plot_images(image_paths):
images_shown = 0
plt.figure(figsize=(16, 9))
for img_path in image_paths:
if os.path.isfile(img_path):
image = Image.open(img_path)
plt.subplot(2, 3, images_shown + 1)
plt.imshow(image)
plt.xticks([])
plt.yticks([])
images_shown += 1
if images_shown >= 7:
break
from PIL import Image import matplotlib.pyplot as plt import os def plot_images(image_paths): images_shown = 0 plt.figure(figsize=(16, 9)) for img_path in image_paths: if os.path.isfile(img_path): image = Image.open(img_path) plt.subplot(2, 3, images_shown + 1) plt.imshow(image) plt.xticks([]) plt.yticks([]) images_shown += 1 if images_shown >= 7: break
In [ ]
已复制!
def download_video(url, output_path):
"""
Download a video from a given url and save it to the output path.
Parameters:
url (str): The url of the video to download.
output_path (str): The path to save the video to.
Returns:
dict: A dictionary containing the metadata of the video.
"""
yt = YouTube(url)
metadata = {"Author": yt.author, "Title": yt.title, "Views": yt.views}
yt.streams.get_highest_resolution().download(
output_path=output_path, filename="input_vid.mp4"
)
return metadata
def video_to_images(video_path, output_folder):
"""
Convert a video to a sequence of images and save them to the output folder.
Parameters:
video_path (str): The path to the video file.
output_folder (str): The path to the folder to save the images to.
"""
clip = VideoFileClip(video_path)
clip.write_images_sequence(
os.path.join(output_folder, "frame%04d.png"), fps=0.2
)
def video_to_audio(video_path, output_audio_path):
"""
Convert a video to audio and save it to the output path.
Parameters:
video_path (str): The path to the video file.
output_audio_path (str): The path to save the audio to.
"""
clip = VideoFileClip(video_path)
audio = clip.audio
audio.write_audiofile(output_audio_path)
def audio_to_text(audio_path):
"""
Convert audio to text using the SpeechRecognition library.
Parameters:
audio_path (str): The path to the audio file.
Returns:
test (str): The text recognized from the audio.
"""
recognizer = sr.Recognizer()
audio = sr.AudioFile(audio_path)
with audio as source:
# Record the audio data
audio_data = recognizer.record(source)
try:
# Recognize the speech
text = recognizer.recognize_whisper(audio_data)
except sr.UnknownValueError:
print("Speech recognition could not understand the audio.")
except sr.RequestError as e:
print(f"Could not request results from service; {e}")
return text
def download_video(url, output_path): """ 从给定 url 下载视频并将其保存到输出路径。 Parameters: url (str): 要下载的视频的 url。 output_path (str): 保存视频的路径。 Returns: dict: 包含视频元数据的字典。 """ yt = YouTube(url) metadata = {"Author": yt.author, "Title": yt.title, "Views": yt.views} yt.streams.get_highest_resolution().download( output_path=output_path, filename="input_vid.mp4" ) return metadata def video_to_images(video_path, output_folder): """ 将视频转换为图像序列并将其保存到输出文件夹。 Parameters: video_path (str): 视频文件路径。 output_folder (str): 保存图像的文件夹路径。 """ clip = VideoFileClip(video_path) clip.write_images_sequence( os.path.join(output_folder, "frame%04d.png"), fps=0.2 ) def video_to_audio(video_path, output_audio_path): """ 将视频转换为音频并将其保存到输出路径。 Parameters: video_path (str): 视频文件路径。 output_audio_path (str): 保存音频的路径。 """ clip = VideoFileClip(video_path) audio = clip.audio audio.write_audiofile(output_audio_path) def audio_to_text(audio_path): """ 使用 SpeechRecognition 库将音频转换为文本。 Parameters: audio_path (str): 音频文件路径。 Returns: test (str): 从音频中识别出的文本。 """ recognizer = sr.Recognizer() audio = sr.AudioFile(audio_path) with audio as source: # Record the audio data audio_data = recognizer.record(source) try: # Recognize the speech text = recognizer.recognize_whisper(audio_data) except sr.UnknownValueError: print("语音识别无法理解音频。") except sr.RequestError as e: print(f"无法从服务请求结果;{e}") return text
In [ ]
已复制!
try:
metadata_vid = download_video(video_url, output_video_path)
video_to_images(filepath, output_folder)
video_to_audio(filepath, output_audio_path)
text_data = audio_to_text(output_audio_path)
with open(output_folder + "output_text.txt", "w") as file:
file.write(text_data)
print("Text data saved to file")
file.close()
os.remove(output_audio_path)
print("Audio file removed")
except Exception as e:
raise e
try: metadata_vid = download_video(video_url, output_video_path) video_to_images(filepath, output_folder) video_to_audio(filepath, output_audio_path) text_data = audio_to_text(output_audio_path) with open(output_folder + "output_text.txt", "w") as file: file.write(text_data) print("文本数据已保存到文件") file.close() os.remove(output_audio_path) print("音频文件已删除") except Exception as e: raise e
创建多模态索引¶
In [ ]
已复制!
from llama_index.core.indices import MultiModalVectorStoreIndex
from llama_index.core import SimpleDirectoryReader, StorageContext
from llama_index.core import SimpleDirectoryReader, StorageContext
from llama_index.vector_stores.lancedb import LanceDBVectorStore
from llama_index.core import SimpleDirectoryReader
text_store = LanceDBVectorStore(uri="lancedb", table_name="text_collection")
image_store = LanceDBVectorStore(uri="lancedb", table_name="image_collection")
storage_context = StorageContext.from_defaults(
vector_store=text_store, image_store=image_store
)
# Create the MultiModal index
documents = SimpleDirectoryReader(output_folder).load_data()
index = MultiModalVectorStoreIndex.from_documents(
documents,
storage_context=storage_context,
)
from llama_index.core.indices import MultiModalVectorStoreIndex from llama_index.core import SimpleDirectoryReader, StorageContext from llama_index.core import SimpleDirectoryReader, StorageContext from llama_index.vector_stores.lancedb import LanceDBVectorStore from llama_index.core import SimpleDirectoryReader text_store = LanceDBVectorStore(uri="lancedb", table_name="text_collection") image_store = LanceDBVectorStore(uri="lancedb", table_name="image_collection") storage_context = StorageContext.from_defaults( vector_store=text_store, image_store=image_store ) # 创建多模态索引 documents = SimpleDirectoryReader(output_folder).load_data() index = MultiModalVectorStoreIndex.from_documents( documents, storage_context=storage_context, )
使用索引作为检索器从多模态向量索引中获取排名前 k (本例中为 5) 的结果¶
In [ ]
已复制!
retriever_engine = index.as_retriever(
similarity_top_k=5, image_similarity_top_k=5
)
retriever_engine = index.as_retriever( similarity_top_k=5, image_similarity_top_k=5 )
设置 RAG prompt 模板¶
In [ ]
已复制!
import json
metadata_str = json.dumps(metadata_vid)
qa_tmpl_str = (
"Given the provided information, including relevant images and retrieved context from the video, \
accurately and precisely answer the query without any additional prior knowledge.\n"
"Please ensure honesty and responsibility, refraining from any racist or sexist remarks.\n"
"---------------------\n"
"Context: {context_str}\n"
"Metadata for video: {metadata_str} \n"
"---------------------\n"
"Query: {query_str}\n"
"Answer: "
)
import json metadata_str = json.dumps(metadata_vid) qa_tmpl_str = ( "根据提供的视频相关图像和检索到的上下文信息,\n请准确、精确地回答查询,无需任何额外先验知识。\n" "请确保诚实负责,避免任何种族歧视或性别歧视言论。\n" "---------------------\n" "上下文:{context_str}\n" "视频元数据:{metadata_str} \n" "---------------------\n" "查询:{query_str}\n" "回答:" )
从数据库中根据用户查询检索最相似的文本/图像嵌入¶
In [ ]
已复制!
from llama_index.core.response.notebook_utils import display_source_node
from llama_index.core.schema import ImageNode
def retrieve(retriever_engine, query_str):
retrieval_results = retriever_engine.retrieve(query_str)
retrieved_image = []
retrieved_text = []
for res_node in retrieval_results:
if isinstance(res_node.node, ImageNode):
retrieved_image.append(res_node.node.metadata["file_path"])
else:
display_source_node(res_node, source_length=200)
retrieved_text.append(res_node.text)
return retrieved_image, retrieved_text
from llama_index.core.response.notebook_utils import display_source_node from llama_index.core.schema import ImageNode def retrieve(retriever_engine, query_str): """ 根据用户查询检索最相似的文本/图像嵌入。 Parameters: retriever_engine: 检索器引擎。 query_str: 用户查询字符串。 Returns: 包含检索到的图像文件路径列表和检索到的文本列表的元组。 """ retrieval_results = retriever_engine.retrieve(query_str) retrieved_image = [] retrieved_text = [] for res_node in retrieval_results: if isinstance(res_node.node, ImageNode): retrieved_image.append(res_node.node.metadata["file_path"]) else: display_source_node(res_node, source_length=200) retrieved_text.append(res_node.text) return retrieved_image, retrieved_text
添加查询,获取相关详细信息包括图像,并增强 prompt 模板¶
In [ ]
已复制!
query_str = "Using examples from video, explain all things covered in the video regarding the gaussian function"
img, txt = retrieve(retriever_engine=retriever_engine, query_str=query_str)
image_documents = SimpleDirectoryReader(
input_dir=output_folder, input_files=img
).load_data()
context_str = "".join(txt)
plot_images(img)
query_str = "使用视频中的示例,解释视频中关于高斯函数所涵盖的所有内容" img, txt = retrieve(retriever_engine=retriever_engine, query_str=query_str) image_documents = SimpleDirectoryReader( input_dir=output_folder, input_files=img ).load_data() context_str = "".join(txt) plot_images(img)
节点 ID: bda08ef1-137c-4d69-9bcc-b7005a41a13c
相似度 0.7431071996688843
文本: 正态分布,也就是高斯分布,其基本函数是 e 的负 x 平方次幂。但你可能想知道为什么是这个函数?在我们能想到的所有能得出 s...
节点 ID: 7d6d0f32-ce16-461b-be54-883241252e50
相似度 0.7335695028305054
文本: 这一步实际上相当技术性,超出我想在这里谈论的范围。通常使用这些称为矩母函数的对象,这提供了一个非常抽象的论证...
节点 ID: 519fb788-3927-4842-ad5c-88be61deaf65
相似度 0.7069740295410156
文本: 我们想要计算的本质是这个函数的两个副本之间的卷积是什么样的。如果你记得,在上一段视频中,我们有两种不同的方式来可视化卷积...
节点 ID: f265c3fb-3c9f-4f36-aa2a-fb15efff9783
相似度 0.706935465335846
文本: 这是重点。所有涉及 s 的部分现在与积分变量完全分离。剩下的这个积分有点棘手。我为此做了一个完整的视频。它...
使用 GPT4V 生成最终响应¶
In [ ]
已复制!
from llama_index.multi_modal_llms.openai import OpenAIMultiModal
openai_mm_llm = OpenAIMultiModal(
model="gpt-4o", api_key=OPENAI_API_KEY, max_new_tokens=1500
)
response_1 = openai_mm_llm.complete(
prompt=qa_tmpl_str.format(
context_str=context_str, query_str=query_str, metadata_str=metadata_str
),
image_documents=image_documents,
)
pprint(response_1.text)
from llama_index.multi_modal_llms.openai import OpenAIMultiModal openai_mm_llm = OpenAIMultiModal( model="gpt-4o", api_key=OPENAI_API_KEY, max_new_tokens=1500 ) response_1 = openai_mm_llm.complete( prompt=qa_tmpl_str.format( context_str=context_str, query_str=query_str, metadata_str=metadata_str ), image_documents=image_documents, ) pprint(response_1.text)
('The video by 3Blue1Brown, titled "A pretty reason why Gaussian + Gaussian = ' 'Gaussian," covers several aspects of the Gaussian function, also known as ' "the normal distribution. Here's a summary of the key points discussed in the " 'video:\n' '\n' '1. **Central Limit Theorem**: The video begins by discussing the central ' 'limit theorem, which states that the sum of multiple copies of a random ' 'variable tends to look like a normal distribution. As the number of ' 'variables increases, the approximation to a normal distribution becomes ' 'better.\n' '\n' '2. **Convolution of Random Variables**: The process of adding two random ' 'variables is mathematically represented by a convolution of their respective ' 'distributions. The video explains the concept of convolution and how it is ' 'used to find the distribution of the sum of two random variables.\n' '\n' '3. **Gaussian Function**: The Gaussian function is more complex than just ' '\\( e^{-x^2} \\). The full formula includes a scaling factor to ensure the ' 'area under the curve is 1 (making it a valid probability distribution), a ' 'standard deviation parameter \\( \\sigma \\) to describe the spread, and a ' 'mean parameter \\( \\mu \\) to shift the center. However, the video focuses ' 'on centered distributions with \\( \\mu = 0 \\).\n' '\n' '4. **Visualizing Convolution**: The video presents a visual method to ' 'understand the convolution of two Gaussian functions using diagonal slices ' 'on the xy-plane. This method involves looking at the probability density of ' 'landing on a point (x, y) as \\( f(x) \\times g(y) \\), where f and g are ' 'the two distributions being convolved.\n' '\n' '5. **Rotational Symmetry**: A key property of the Gaussian function is its ' 'rotational symmetry, which is unique to bell curves. This symmetry is ' 'exploited in the video to simplify the calculation of the convolution. By ' 'rotating the graph 45 degrees, the computation becomes easier because the ' 'integral only involves one variable.\n' '\n' '6. **Result of Convolution**: The video demonstrates that the convolution of ' 'two Gaussian functions is another Gaussian function. This is a special ' 'property because convolutions typically result in a different kind of ' 'function. The standard deviation of the resulting Gaussian is \\( \\sqrt{2} ' '\\times \\sigma \\) if the original Gaussians had the same standard ' 'deviation.\n' '\n' '7. **Proof of Central Limit Theorem**: The video explains that the ' 'convolution of two Gaussians being another Gaussian is a crucial step in ' 'proving the central limit theorem. It shows that the Gaussian function is a ' 'fixed point in the space of distributions, and since all distributions with ' 'finite variance tend towards a single universal shape, that shape must be ' 'the Gaussian.\n' '\n' '8. **Connection to Pi**: The video also touches on the connection between ' 'the Gaussian function and the number Pi, which appears in the formula for ' 'the normal distribution.\n' '\n' 'The video aims to provide an intuitive geometric argument for why the sum of ' 'two normally distributed random variables is also normally distributed, and ' 'how this relates to the central limit theorem and the special properties of ' 'the Gaussian function.')