使用 OpenAI GPT4V 和 LanceDB 向量存储处理视频的多模态 RAG¶

在此笔记本中，我们展示了一种专为视频处理设计的多模态 RAG 架构。我们利用 OpenAI GPT4V 多模态 LLM 类，该类使用 CLIP 生成多模态嵌入。此外，我们使用 LanceDBVectorStore 进行高效的向量存储。

步骤

从 YouTube 下载视频，处理并存储。
为文本和图像构建多模态索引和向量存储。
检索相关图像和上下文，并使用它们来增强 prompt。
使用 GPT4V 推理输入查询和增强数据之间的关联，并生成最终响应。

In [ ]

已复制！

%pip install llama-index-vector-stores-lancedb
%pip install llama-index-multi-modal-llms-openai
%pip install llama-index-vector-stores-lancedb %pip install llama-index-multi-modal-llms-openai

In [ ]

已复制！

%pip install llama-index-multi-modal-llms-openai
%pip install llama-index-vector-stores-lancedb
%pip install llama-index-embeddings-clip
%pip install llama-index-multi-modal-llms-openai %pip install llama-index-vector-stores-lancedb %pip install llama-index-embeddings-clip

In [ ]

已复制！





%pip install llama_index ftfy regex tqdm
%pip install -U openai-whisper
%pip install git+https://github.com/openai/CLIP.git
%pip install torch torchvision
%pip install matplotlib scikit-image
%pip install lancedb
%pip install moviepy
%pip install pytube
%pip install pydub
%pip install SpeechRecognition
%pip install ffmpeg-python
%pip install soundfile
%pip install llama_index ftfy regex tqdm %pip install -U openai-whisper %pip install git+https://github.com/openai/CLIP.git %pip install torch torchvision %pip install matplotlib scikit-image %pip install lancedb %pip install moviepy %pip install pytube %pip install pydub %pip install SpeechRecognition %pip install ffmpeg-python %pip install soundfile

In [ ]

已复制！

from moviepy.editor import VideoFileClip
from pathlib import Path
import speech_recognition as sr
from pytube import YouTube
from pprint import pprint
from moviepy.editor import VideoFileClip from pathlib import Path import speech_recognition as sr from pytube import YouTube from pprint import pprint

In [ ]

已复制！

import os

OPENAI_API_KEY = ""
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
import os OPENAI_API_KEY = "" os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

在下方设置输入配置¶

In [ ]

已复制！

video_url = "https://www.youtube.com/watch?v=d_qvLDhkg00"
output_video_path = "./video_data/"
output_folder = "./mixed_data/"
output_audio_path = "./mixed_data/output_audio.wav"

filepath = output_video_path + "input_vid.mp4"
Path(output_folder).mkdir(parents=True, exist_ok=True)
video_url = "https://www.youtube.com/watch?v=d_qvLDhkg00" output_video_path = "./video_data/" output_folder = "./mixed_data/" output_audio_path = "./mixed_data/output_audio.wav" filepath = output_video_path + "input_vid.mp4" Path(output_folder).mkdir(parents=True, exist_ok=True)

下载视频并处理成适合生成/存储嵌入的格式¶

In [ ]

已复制！





from PIL import Image
import matplotlib.pyplot as plt
import os


def plot_images(image_paths):
    images_shown = 0
    plt.figure(figsize=(16, 9))
    for img_path in image_paths:
        if os.path.isfile(img_path):
            image = Image.open(img_path)

            plt.subplot(2, 3, images_shown + 1)
            plt.imshow(image)
            plt.xticks([])
            plt.yticks([])

            images_shown += 1
            if images_shown >= 7:
                break
from PIL import Image import matplotlib.pyplot as plt import os def plot_images(image_paths): images_shown = 0 plt.figure(figsize=(16, 9)) for img_path in image_paths: if os.path.isfile(img_path): image = Image.open(img_path) plt.subplot(2, 3, images_shown + 1) plt.imshow(image) plt.xticks([]) plt.yticks([]) images_shown += 1 if images_shown >= 7: break

In [ ]

已复制！





def download_video(url, output_path):
    """
    Download a video from a given url and save it to the output path.

    Parameters:
    url (str): The url of the video to download.
    output_path (str): The path to save the video to.

    Returns:
    dict: A dictionary containing the metadata of the video.
    """
    yt = YouTube(url)
    metadata = {"Author": yt.author, "Title": yt.title, "Views": yt.views}
    yt.streams.get_highest_resolution().download(
        output_path=output_path, filename="input_vid.mp4"
    )
    return metadata


def video_to_images(video_path, output_folder):
    """
    Convert a video to a sequence of images and save them to the output folder.

    Parameters:
    video_path (str): The path to the video file.
    output_folder (str): The path to the folder to save the images to.

    """
    clip = VideoFileClip(video_path)
    clip.write_images_sequence(
        os.path.join(output_folder, "frame%04d.png"), fps=0.2
    )


def video_to_audio(video_path, output_audio_path):
    """
    Convert a video to audio and save it to the output path.

    Parameters:
    video_path (str): The path to the video file.
    output_audio_path (str): The path to save the audio to.

    """
    clip = VideoFileClip(video_path)
    audio = clip.audio
    audio.write_audiofile(output_audio_path)


def audio_to_text(audio_path):
    """
    Convert audio to text using the SpeechRecognition library.

    Parameters:
    audio_path (str): The path to the audio file.

    Returns:
    test (str): The text recognized from the audio.

    """
    recognizer = sr.Recognizer()
    audio = sr.AudioFile(audio_path)

    with audio as source:
        # Record the audio data
        audio_data = recognizer.record(source)

        try:
            # Recognize the speech
            text = recognizer.recognize_whisper(audio_data)
        except sr.UnknownValueError:
            print("Speech recognition could not understand the audio.")
        except sr.RequestError as e:
            print(f"Could not request results from service; {e}")

    return text
def download_video(url, output_path): """ 从给定 url 下载视频并将其保存到输出路径。 Parameters: url (str): 要下载的视频的 url。 output_path (str): 保存视频的路径。 Returns: dict: 包含视频元数据的字典。 """ yt = YouTube(url) metadata = {"Author": yt.author, "Title": yt.title, "Views": yt.views} yt.streams.get_highest_resolution().download( output_path=output_path, filename="input_vid.mp4" ) return metadata def video_to_images(video_path, output_folder): """ 将视频转换为图像序列并将其保存到输出文件夹。 Parameters: video_path (str): 视频文件路径。 output_folder (str): 保存图像的文件夹路径。 """ clip = VideoFileClip(video_path) clip.write_images_sequence( os.path.join(output_folder, "frame%04d.png"), fps=0.2 ) def video_to_audio(video_path, output_audio_path): """ 将视频转换为音频并将其保存到输出路径。 Parameters: video_path (str): 视频文件路径。 output_audio_path (str): 保存音频的路径。 """ clip = VideoFileClip(video_path) audio = clip.audio audio.write_audiofile(output_audio_path) def audio_to_text(audio_path): """ 使用 SpeechRecognition 库将音频转换为文本。 Parameters: audio_path (str): 音频文件路径。 Returns: test (str): 从音频中识别出的文本。 """ recognizer = sr.Recognizer() audio = sr.AudioFile(audio_path) with audio as source: # Record the audio data audio_data = recognizer.record(source) try: # Recognize the speech text = recognizer.recognize_whisper(audio_data) except sr.UnknownValueError: print("语音识别无法理解音频。") except sr.RequestError as e: print(f"无法从服务请求结果；{e}") return text

In [ ]

已复制！





try:
    metadata_vid = download_video(video_url, output_video_path)
    video_to_images(filepath, output_folder)
    video_to_audio(filepath, output_audio_path)
    text_data = audio_to_text(output_audio_path)

    with open(output_folder + "output_text.txt", "w") as file:
        file.write(text_data)
    print("Text data saved to file")
    file.close()
    os.remove(output_audio_path)
    print("Audio file removed")

except Exception as e:
    raise e
try: metadata_vid = download_video(video_url, output_video_path) video_to_images(filepath, output_folder) video_to_audio(filepath, output_audio_path) text_data = audio_to_text(output_audio_path) with open(output_folder + "output_text.txt", "w") as file: file.write(text_data) print("文本数据已保存到文件") file.close() os.remove(output_audio_path) print("音频文件已删除") except Exception as e: raise e

In [ ]

已复制！





from llama_index.core.indices import MultiModalVectorStoreIndex
from llama_index.core import SimpleDirectoryReader, StorageContext

from llama_index.core import SimpleDirectoryReader, StorageContext
from llama_index.vector_stores.lancedb import LanceDBVectorStore


from llama_index.core import SimpleDirectoryReader

text_store = LanceDBVectorStore(uri="lancedb", table_name="text_collection")
image_store = LanceDBVectorStore(uri="lancedb", table_name="image_collection")
storage_context = StorageContext.from_defaults(
    vector_store=text_store, image_store=image_store
)

# Create the MultiModal index
documents = SimpleDirectoryReader(output_folder).load_data()

index = MultiModalVectorStoreIndex.from_documents(
    documents,
    storage_context=storage_context,
)
from llama_index.core.indices import MultiModalVectorStoreIndex from llama_index.core import SimpleDirectoryReader, StorageContext from llama_index.core import SimpleDirectoryReader, StorageContext from llama_index.vector_stores.lancedb import LanceDBVectorStore from llama_index.core import SimpleDirectoryReader text_store = LanceDBVectorStore(uri="lancedb", table_name="text_collection") image_store = LanceDBVectorStore(uri="lancedb", table_name="image_collection") storage_context = StorageContext.from_defaults( vector_store=text_store, image_store=image_store ) # 创建多模态索引 documents = SimpleDirectoryReader(output_folder).load_data() index = MultiModalVectorStoreIndex.from_documents( documents, storage_context=storage_context, )

使用索引作为检索器从多模态向量索引中获取排名前 k (本例中为 5) 的结果¶

In [ ]

已复制！

retriever_engine = index.as_retriever(
    similarity_top_k=5, image_similarity_top_k=5
)
retriever_engine = index.as_retriever( similarity_top_k=5, image_similarity_top_k=5 )

设置 RAG prompt 模板¶

In [ ]

已复制！





import json

metadata_str = json.dumps(metadata_vid)

qa_tmpl_str = (
    "Given the provided information, including relevant images and retrieved context from the video, \
 accurately and precisely answer the query without any additional prior knowledge.\n"
    "Please ensure honesty and responsibility, refraining from any racist or sexist remarks.\n"
    "---------------------\n"
    "Context: {context_str}\n"
    "Metadata for video: {metadata_str} \n"
    "---------------------\n"
    "Query: {query_str}\n"
    "Answer: "
)
import json metadata_str = json.dumps(metadata_vid) qa_tmpl_str = ( "根据提供的视频相关图像和检索到的上下文信息，\n请准确、精确地回答查询，无需任何额外先验知识。\n" "请确保诚实负责，避免任何种族歧视或性别歧视言论。\n" "---------------------\n" "上下文：{context_str}\n" "视频元数据：{metadata_str} \n" "---------------------\n" "查询：{query_str}\n" "回答：" )

从数据库中根据用户查询检索最相似的文本/图像嵌入¶

In [ ]

已复制！





from llama_index.core.response.notebook_utils import display_source_node
from llama_index.core.schema import ImageNode


def retrieve(retriever_engine, query_str):
    retrieval_results = retriever_engine.retrieve(query_str)

    retrieved_image = []
    retrieved_text = []
    for res_node in retrieval_results:
        if isinstance(res_node.node, ImageNode):
            retrieved_image.append(res_node.node.metadata["file_path"])
        else:
            display_source_node(res_node, source_length=200)
            retrieved_text.append(res_node.text)

    return retrieved_image, retrieved_text
from llama_index.core.response.notebook_utils import display_source_node from llama_index.core.schema import ImageNode def retrieve(retriever_engine, query_str): """ 根据用户查询检索最相似的文本/图像嵌入。 Parameters: retriever_engine: 检索器引擎。 query_str: 用户查询字符串。 Returns: 包含检索到的图像文件路径列表和检索到的文本列表的元组。 """ retrieval_results = retriever_engine.retrieve(query_str) retrieved_image = [] retrieved_text = [] for res_node in retrieval_results: if isinstance(res_node.node, ImageNode): retrieved_image.append(res_node.node.metadata["file_path"]) else: display_source_node(res_node, source_length=200) retrieved_text.append(res_node.text) return retrieved_image, retrieved_text

添加查询，获取相关详细信息包括图像，并增强 prompt 模板¶

In [ ]

已复制！





query_str = "Using examples from video, explain all things covered in the video regarding the gaussian function"

img, txt = retrieve(retriever_engine=retriever_engine, query_str=query_str)
image_documents = SimpleDirectoryReader(
    input_dir=output_folder, input_files=img
).load_data()
context_str = "".join(txt)
plot_images(img)
query_str = "使用视频中的示例，解释视频中关于高斯函数所涵盖的所有内容" img, txt = retrieve(retriever_engine=retriever_engine, query_str=query_str) image_documents = SimpleDirectoryReader( input_dir=output_folder, input_files=img ).load_data() context_str = "".join(txt) plot_images(img)

节点 ID: bda08ef1-137c-4d69-9bcc-b7005a41a13c
相似度 0.7431071996688843
文本： 正态分布，也就是高斯分布，其基本函数是 e 的负 x 平方次幂。但你可能想知道为什么是这个函数？在我们能想到的所有能得出 s...

节点 ID: 7d6d0f32-ce16-461b-be54-883241252e50
相似度 0.7335695028305054
文本： 这一步实际上相当技术性，超出我想在这里谈论的范围。通常使用这些称为矩母函数的对象，这提供了一个非常抽象的论证...

节点 ID: 519fb788-3927-4842-ad5c-88be61deaf65
相似度 0.7069740295410156
文本： 我们想要计算的本质是这个函数的两个副本之间的卷积是什么样的。如果你记得，在上一段视频中，我们有两种不同的方式来可视化卷积...

节点 ID: f265c3fb-3c9f-4f36-aa2a-fb15efff9783
相似度 0.706935465335846
文本： 这是重点。所有涉及 s 的部分现在与积分变量完全分离。剩下的这个积分有点棘手。我为此做了一个完整的视频。它...

No description has been provided for this image

使用 GPT4V 生成最终响应¶

In [ ]

已复制！





from llama_index.multi_modal_llms.openai import OpenAIMultiModal

openai_mm_llm = OpenAIMultiModal(
    model="gpt-4o", api_key=OPENAI_API_KEY, max_new_tokens=1500
)


response_1 = openai_mm_llm.complete(
    prompt=qa_tmpl_str.format(
        context_str=context_str, query_str=query_str, metadata_str=metadata_str
    ),
    image_documents=image_documents,
)

pprint(response_1.text)
from llama_index.multi_modal_llms.openai import OpenAIMultiModal openai_mm_llm = OpenAIMultiModal( model="gpt-4o", api_key=OPENAI_API_KEY, max_new_tokens=1500 ) response_1 = openai_mm_llm.complete( prompt=qa_tmpl_str.format( context_str=context_str, query_str=query_str, metadata_str=metadata_str ), image_documents=image_documents, ) pprint(response_1.text)

('The video by 3Blue1Brown, titled "A pretty reason why Gaussian + Gaussian = '
 'Gaussian," covers several aspects of the Gaussian function, also known as '
 "the normal distribution. Here's a summary of the key points discussed in the "
 'video:\n'
 '\n'
 '1. **Central Limit Theorem**: The video begins by discussing the central '
 'limit theorem, which states that the sum of multiple copies of a random '
 'variable tends to look like a normal distribution. As the number of '
 'variables increases, the approximation to a normal distribution becomes '
 'better.\n'
 '\n'
 '2. **Convolution of Random Variables**: The process of adding two random '
 'variables is mathematically represented by a convolution of their respective '
 'distributions. The video explains the concept of convolution and how it is '
 'used to find the distribution of the sum of two random variables.\n'
 '\n'
 '3. **Gaussian Function**: The Gaussian function is more complex than just '
 '\\( e^{-x^2} \\). The full formula includes a scaling factor to ensure the '
 'area under the curve is 1 (making it a valid probability distribution), a '
 'standard deviation parameter \\( \\sigma \\) to describe the spread, and a '
 'mean parameter \\( \\mu \\) to shift the center. However, the video focuses '
 'on centered distributions with \\( \\mu = 0 \\).\n'
 '\n'
 '4. **Visualizing Convolution**: The video presents a visual method to '
 'understand the convolution of two Gaussian functions using diagonal slices '
 'on the xy-plane. This method involves looking at the probability density of '
 'landing on a point (x, y) as \\( f(x) \\times g(y) \\), where f and g are '
 'the two distributions being convolved.\n'
 '\n'
 '5. **Rotational Symmetry**: A key property of the Gaussian function is its '
 'rotational symmetry, which is unique to bell curves. This symmetry is '
 'exploited in the video to simplify the calculation of the convolution. By '
 'rotating the graph 45 degrees, the computation becomes easier because the '
 'integral only involves one variable.\n'
 '\n'
 '6. **Result of Convolution**: The video demonstrates that the convolution of '
 'two Gaussian functions is another Gaussian function. This is a special '
 'property because convolutions typically result in a different kind of '
 'function. The standard deviation of the resulting Gaussian is \\( \\sqrt{2} '
 '\\times \\sigma \\) if the original Gaussians had the same standard '
 'deviation.\n'
 '\n'
 '7. **Proof of Central Limit Theorem**: The video explains that the '
 'convolution of two Gaussians being another Gaussian is a crucial step in '
 'proving the central limit theorem. It shows that the Gaussian function is a '
 'fixed point in the space of distributions, and since all distributions with '
 'finite variance tend towards a single universal shape, that shape must be '
 'the Gaussian.\n'
 '\n'
 '8. **Connection to Pi**: The video also touches on the connection between '
 'the Gaussian function and the number Pi, which appears in the formula for '
 'the normal distribution.\n'
 '\n'
 'The video aims to provide an intuitive geometric argument for why the sum of '
 'two normally distributed random variables is also normally distributed, and '
 'how this relates to the central limit theorem and the special properties of '
 'the Gaussian function.')

使用 OpenAI GPT4V 和 LanceDB 向量存储处理视频的多模态 RAG¶

在下方设置输入配置¶

下载视频并处理成适合生成/存储嵌入的格式¶

创建多模态索引¶

使用索引作为检索器从多模态向量索引中获取排名前 k (本例中为 5) 的结果¶

设置 RAG prompt 模板¶

从数据库中根据用户查询检索最相似的文本/图像嵌入¶

添加查询，获取相关详细信息包括图像，并增强 prompt 模板¶

使用 GPT4V 生成最终响应¶