使用 Google Gemini 模型进行图像理解并使用 LlamaIndex 构建检索增强生成的多模态大型语言模型¶
在此笔记本中,我们展示如何使用 Google 的 Gemini Vision 模型进行图像理解。
首先,我们展示我们现在支持 Gemini 的几个功能
complete
(同步和异步):用于单个提示和图像列表chat
(同步和异步):用于多个聊天消息stream complete
(同步和异步):用于完整输出的流式传输stream chat
(同步和异步):用于聊天输出的流式传输
此笔记本的第二部分,我们尝试使用 Gemini
+ Pydantic
来解析 Google 地图中图像的结构化信息。
- 定义带有属性字段的期望 Pydantic 类
- 让
gemini-pro-vision
模型理解每张图像并输出结构化结果
此笔记本的第三部分,我们建议使用 Gemini & LlamaIndex 为小型 Google 地图餐厅数据集构建一个简单的检索增强生成
流程。
- 根据步骤 2 的结构化输出构建向量索引
- 使用
gemini-pro
模型合成结果并根据用户查询推荐餐厅。
注意:google-generativeai
仅在某些国家和地区可用。
In [ ]
Copied!
%pip install llama-index-multi-modal-llms-gemini
%pip install llama-index-vector-stores-qdrant
%pip install llama-index-embeddings-gemini
%pip install llama-index-llms-gemini
%pip install llama-index-multi-modal-llms-gemini %pip install llama-index-vector-stores-qdrant %pip install llama-index-embeddings-gemini %pip install llama-index-llms-gemini
In [ ]
Copied!
!pip install llama-index 'google-generativeai>=0.3.0' matplotlib qdrant_client
!pip install llama-index 'google-generativeai>=0.3.0' matplotlib qdrant_client
使用 Gemini 理解 URL 中的图像¶
In [ ]
Copied!
%env GOOGLE_API_KEY=...
%env GOOGLE_API_KEY=...
In [ ]
Copied!
import os
GOOGLE_API_KEY = "" # add your GOOGLE API key here
os.environ["GOOGLE_API_KEY"] = GOOGLE_API_KEY
import os GOOGLE_API_KEY = "" # 在此处添加您的 GOOGLE API 密钥 os.environ["GOOGLE_API_KEY"] = GOOGLE_API_KEY
初始化 GeminiMultiModal
并从 URL 加载图像¶
In [ ]
Copied!
from llama_index.llms.gemini import Gemini
from llama_index.core.llms import ChatMessage, ImageBlock
image_urls = [
"https://storage.googleapis.com/generativeai-downloads/data/scene.jpg",
# Add yours here!
]
gemini_pro = Gemini(model_name="models/gemini-1.5-flash")
msg = ChatMessage("Identify the city where this photo was taken.")
for img_url in image_urls:
msg.blocks.append(ImageBlock(url=img_url))
from llama_index.llms.gemini import Gemini from llama_index.core.llms import ChatMessage, ImageBlock image_urls = [ "https://storage.googleapis.com/generativeai-downloads/data/scene.jpg", # 在此处添加您的! ] gemini_pro = Gemini(model_name="models/gemini-1.5-flash") msg = ChatMessage("Identify the city where this photo was taken.") for img_url in image_urls: msg.blocks.append(ImageBlock(url=img_url))
In [ ]
Copied!
from PIL import Image
import requests
from io import BytesIO
import matplotlib.pyplot as plt
img_response = requests.get(image_urls[0])
print(image_urls[0])
img = Image.open(BytesIO(img_response.content))
plt.imshow(img)
from PIL import Image import requests from io import BytesIO import matplotlib.pyplot as plt img_response = requests.get(image_urls[0]) print(image_urls[0]) img = Image.open(BytesIO(img_response.content)) plt.imshow(img)
https://storage.googleapis.com/generativeai-downloads/data/scene.jpg
Out[ ]
<matplotlib.image.AxesImage at 0x128032e40>
在提示中使用图像进行聊天¶
In [ ]
Copied!
response = gemini_pro.chat(messages=[msg])
response = gemini_pro.chat(messages=[msg])
In [ ]
Copied!
print(response.message.content)
print(response.message.content)
That's New York City. More specifically, the photo shows a street in the **SoHo** neighborhood. The distinctive cast-iron architecture and the pedestrian bridge are characteristic of that area.
使用图像流式聊天¶
In [ ]
Copied!
stream_response = gemini_pro.stream_chat(messages=[msg])
stream_response = gemini_pro.stream_chat(messages=[msg])
In [ ]
Copied!
import time
for r in stream_response:
print(r.delta, end="")
# Add an artificial wait to make streaming visible in the notebook
time.sleep(0.5)
import time for r in stream_response: print(r.delta, end="") # 添加人工等待以使流式传输在笔记本中可见 time.sleep(0.5)
That's New York City. More specifically, the photo was taken in the **West Village** neighborhood of Manhattan. The distinctive architecture and the pedestrian bridge are strong clues.
异步支持¶
In [ ]
Copied!
response_achat = await gemini_pro.achat(messages=[msg])
response_achat = await gemini_pro.achat(messages=[msg])
In [ ]
Copied!
print(response_achat.message.content)
print(response_achat.message.content)
That's New York City. More specifically, the photo was taken in the **West Village** neighborhood of Manhattan. The distinctive architecture and the pedestrian bridge are strong clues.
让我们看看如何异步流式传输
In [ ]
Copied!
import asyncio
streaming_handler = await gemini_pro.astream_chat(messages=[msg])
async for chunk in streaming_handler:
print(chunk.delta, end="")
# Add an artificial wait to make streaming visible in the notebook
await asyncio.sleep(0.5)
import asyncio streaming_handler = await gemini_pro.astream_chat(messages=[msg]) async for chunk in streaming_handler: print(chunk.delta, end="") # 添加人工等待以使流式传输在笔记本中可见 await asyncio.sleep(0.5)
That's New York City. More specifically, the photo was taken in the **West Village** neighborhood of Manhattan. The distinctive architecture and the pedestrian bridge are strong clues.
使用两张图像完成¶
In [ ]
Copied!
image_urls = [
"https://picsum.photos/id/1/200/300",
"https://picsum.photos/id/26/200/300",
]
msg = ChatMessage("Is there any relationship between these images?")
for img_url in image_urls:
msg.blocks.append(ImageBlock(url=img_url))
response_multi = gemini_pro.chat(messages=[msg])
image_urls = [ "https://picsum.photos/id/1/200/300", "https://picsum.photos/id/26/200/300", ] msg = ChatMessage("Is there any relationship between these images?") for img_url in image_urls: msg.blocks.append(ImageBlock(url=img_url)) response_multi = gemini_pro.chat(messages=[msg])
In [ ]
Copied!
print(response_multi.message.content)
print(response_multi.message.content)
Yes, there is a relationship between the two images. Both images depict aspects of a **professional or business-casual lifestyle**. * **Image 1:** Shows someone working on a laptop, suggesting remote work, freelancing, or a business-related task. * **Image 2:** Shows a flat lay of accessories commonly associated with a professional or stylish individual: sunglasses, a bow tie, a pen, a watch, glasses, and a phone. These items suggest a certain level of personal style and preparedness often associated with business or professional settings. The connection is indirect but thematic. They both visually represent elements of a similar lifestyle or persona.
第二部分:Gemini
+ Pydantic
用于从图像解析结构化输出¶
- 利用 Gemini 进行图像推理
- 使用 Pydantic 程序从 Gemini 的图像推理结果中生成结构化输出
下载示例图像供 Gemini 理解¶
In [ ]
Copied!
from pathlib import Path
input_image_path = Path("google_restaurants")
if not input_image_path.exists():
Path.mkdir(input_image_path)
from pathlib import Path input_image_path = Path("google_restaurants") if not input_image_path.exists(): Path.mkdir(input_image_path)
In [ ]
Copied!
!curl -sL "https://docs.google.com/uc?export=download&id=1Pg04p6ss0FlBgz00noHAOAJ1EYXiosKg" -o ./google_restaurants/miami.png
!curl -sL "https://docs.google.com/uc?export=download&id=1dYZy17bD6pSsEyACXx9fRMNx93ok-kTJ" -o ./google_restaurants/orlando.png
!curl -sL "https://docs.google.com/uc?export=download&id=1ShPnYVc1iL_TA1t7ErCFEAHT74-qvMrn" -o ./google_restaurants/sf.png
!curl -sL "https://docs.google.com/uc?export=download&id=1WjISWnatHjwL4z5VD_9o09ORWhRJuYqm" -o ./google_restaurants/toronto.png
!curl -sL "https://docs.google.com/uc?export=download&id=1Pg04p6ss0FlBgz00noHAOAJ1EYXiosKg" -o ./google_restaurants/miami.png !curl -sL "https://docs.google.com/uc?export=download&id=1dYZy17bD6pSsEyACXx9fRMNx93ok-kTJ" -o ./google_restaurants/orlando.png !curl -sL "https://docs.google.com/uc?export=download&id=1ShPnYVc1iL_TA1t7ErCFEAHT74-qvMrn" -o ./google_restaurants/sf.png !curl -sL "https://docs.google.com/uc?export=download&id=1WjISWnatHjwL4z5VD_9o09ORWhRJuYqm" -o ./google_restaurants/toronto.png
为结构化解析器定义 Pydantic 类¶
In [ ]
Copied!
from pydantic import BaseModel
from PIL import Image
import matplotlib.pyplot as plt
class GoogleRestaurant(BaseModel):
"""Data model for a Google Restaurant."""
restaurant: str
food: str
location: str
category: str
hours: str
price: str
rating: float
review: str
description: str
nearby_tourist_places: str
google_image_url = "./google_restaurants/miami.png"
image = Image.open(google_image_url).convert("RGB")
plt.figure(figsize=(16, 5))
plt.imshow(image)
from pydantic import BaseModel from PIL import Image import matplotlib.pyplot as plt class GoogleRestaurant(BaseModel): """Google Restaurant 的数据模型。""" restaurant: str food: str location: str category: str hours: str price: str rating: float review: str description: str nearby_tourist_places: str google_image_url = "./google_restaurants/miami.png" image = Image.open(google_image_url).convert("RGB") plt.figure(figsize=(16, 5)) plt.imshow(image)
Out[ ]
<matplotlib.image.AxesImage at 0x10953cce0>
调用 Pydantic 程序并生成结构化输出¶
In [ ]
Copied!
from llama_index.multi_modal_llms.gemini import GeminiMultiModal
from llama_index.core.program import MultiModalLLMCompletionProgram
from llama_index.core.output_parsers import PydanticOutputParser
prompt_template_str = """\
can you summarize what is in the image\
and return the answer with json format \
"""
def pydantic_gemini(
model_name, output_class, image_documents, prompt_template_str
):
gemini_llm = GeminiMultiModal(model_name=model_name)
llm_program = MultiModalLLMCompletionProgram.from_defaults(
output_parser=PydanticOutputParser(output_class),
image_documents=image_documents,
prompt_template_str=prompt_template_str,
multi_modal_llm=gemini_llm,
verbose=True,
)
response = llm_program()
return response
from llama_index.multi_modal_llms.gemini import GeminiMultiModal from llama_index.core.program import MultiModalLLMCompletionProgram from llama_index.core.output_parsers import PydanticOutputParser prompt_template_str = """\ can you summarize what is in the image\ and return the answer with json format \ """ def pydantic_gemini( model_name, output_class, image_documents, prompt_template_str ): gemini_llm = GeminiMultiModal(model_name=model_name) llm_program = MultiModalLLMCompletionProgram.from_defaults( output_parser=PydanticOutputParser(output_class), image_documents=image_documents, prompt_template_str=prompt_template_str, multi_modal_llm=gemini_llm, verbose=True, ) response = llm_program() return response
通过 Gemini Vision 模型生成 Pydantic 结构化输出¶
In [ ]
Copied!
from llama_index.core import SimpleDirectoryReader
google_image_documents = SimpleDirectoryReader(
"./google_restaurants"
).load_data()
results = []
for img_doc in google_image_documents:
pydantic_response = pydantic_gemini(
"models/gemini-1.5-flash",
GoogleRestaurant,
[img_doc],
prompt_template_str,
)
# only output the results for miami for example along with image
if "miami" in img_doc.image_path:
for r in pydantic_response:
print(r)
results.append(pydantic_response)
from llama_index.core import SimpleDirectoryReader google_image_documents = SimpleDirectoryReader( "./google_restaurants" ).load_data() results = [] for img_doc in google_image_documents: pydantic_response = pydantic_gemini( "models/gemini-1.5-flash", GoogleRestaurant, [img_doc], prompt_template_str, ) # 例如,仅输出迈阿密的结果以及图像 if "miami" in img_doc.image_path: for r in pydantic_response: print(r) results.append(pydantic_response)
> Raw output: ```json { "restaurant": "La Mar by Gaston Acurio", "food": "Peruvian & fusion", "location": "500 Brickell Key Dr, Miami, FL 33131", "category": "South American restaurant", "hours": "Opens 6PM, Closes 11 PM", "price": "$$$", "rating": 4.4, "review": "Chic waterfront offering Peruvian & fusion fare, plus bars for cocktails, ceviche & anticuchos.", "description": "Chic waterfront offering Peruvian & fusion fare, plus bars for cocktails, ceviche & anticuchos.", "nearby_tourist_places": "Brickell Key area with scenic views" } ``` ('restaurant', 'La Mar by Gaston Acurio') ('food', 'Peruvian & fusion') ('location', '500 Brickell Key Dr, Miami, FL 33131') ('category', 'South American restaurant') ('hours', 'Opens 6PM, Closes 11 PM') ('price', '$$$') ('rating', 4.4) ('review', 'Chic waterfront offering Peruvian & fusion fare, plus bars for cocktails, ceviche & anticuchos.') ('description', 'Chic waterfront offering Peruvian & fusion fare, plus bars for cocktails, ceviche & anticuchos.') ('nearby_tourist_places', 'Brickell Key area with scenic views') > Raw output: ```json { "restaurant": "Mythos Restaurant", "food": "American fare in a mythic underwater themed spot", "location": "6000 Universal Blvd, Orlando, FL 32819, United States", "category": "Restaurant", "hours": "Open: Closes in 7 hrs, Islands of Adventure", "price": "$$", "rating": 4.3, "review": "Overlooking Universal Studios/Island sea, this mythic underwater themed spot serves American fare.", "description": "Dine-in, Delivery", "nearby_tourist_places": "Universal Islands, Jurassic Park River Adventure" } ``` > Raw output: ```json { "restaurant": "Sam's Grill & Seafood Restaurant", "food": "Seafood", "location": "374 Bush St, San Francisco, CA 94104, United States", "category": "Seafood Restaurant", "hours": "Open ⋅ Closes 8:30 PM", "price": "$$$", "rating": 4.4, "review": "Modern spin-off adjacent Sam's Grill, for seafood, drinks & happy hour loungey digs with a patio.", "description": "Modern spin-off adjacent Sam's Grill, for seafood, drinks & happy hour loungey digs with a patio.", "nearby_tourist_places": "Chinatown, San Francisco" } ``` > Raw output: ```json { "restaurant": "Lobster Port", "food": "Seafood restaurant offering lobster, dim sum & Asian fusion dishes", "location": "8432 Leslie St, Thornhill, ON L3T 7M6", "category": "Seafood", "hours": "Open 10pm", "price": "$$", "rating": 4.0, "review": "Elegant, lively venue with a banquet-hall setup", "description": "Elegant, lively venue with a banquet-hall setup offering lobster, dim sum & Asian fusion dishes.", "nearby_tourist_places": "Nearby tourist places are not explicitly listed in the image but the map shows various points of interest in the surrounding area." } ```
观察
:
- Gemini 完美地生成了 Pydantic 类所需的所有元信息
- 它也能识别 Google 地图附近的公园
第三部分:为餐厅推荐构建多模态 RAG¶
我们的技术栈包括 Gemini + LlamaIndex + Pydantic 结构化输出能力
构建文本节点用于构建向量存储。存储每个餐厅的元数据和描述。¶
In [ ]
Copied!
from llama_index.core.schema import TextNode
nodes = []
for res in results:
text_node = TextNode()
metadata = {}
for r in res:
# set description as text of TextNode
if r[0] == "description":
text_node.text = r[1]
else:
metadata[r[0]] = r[1]
text_node.metadata = metadata
nodes.append(text_node)
from llama_index.core.schema import TextNode nodes = [] for res in results: text_node = TextNode() metadata = {} for r in res: # set description as text of TextNode if r[0] == "description": text_node.text = r[1] else: metadata[r[0]] = r[1] text_node.metadata = metadata nodes.append(text_node)
使用 Gemini Embedding 构建向量存储以进行密集检索。将餐厅作为节点索引到向量存储中¶
In [ ]
Copied!
from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.embeddings.gemini import GeminiEmbedding
from llama_index.llms.gemini import Gemini
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.core import Settings
from llama_index.core import StorageContext
import qdrant_client
# Create a local Qdrant vector store
client = qdrant_client.QdrantClient(path="qdrant_gemini_3")
vector_store = QdrantVectorStore(client=client, collection_name="collection")
# Using the embedding model to Gemini
Settings.embed_model = GeminiEmbedding(
model_name="models/embedding-001", api_key=GOOGLE_API_KEY
)
Settings.llm = Gemini(api_key=GOOGLE_API_KEY)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex(
nodes=nodes,
storage_context=storage_context,
)
from llama_index.core import VectorStoreIndex, StorageContext from llama_index.embeddings.gemini import GeminiEmbedding from llama_index.llms.gemini import Gemini from llama_index.vector_stores.qdrant import QdrantVectorStore from llama_index.core import Settings from llama_index.core import StorageContext import qdrant_client # Create a local Qdrant vector store client = qdrant_client.QdrantClient(path="qdrant_gemini_3") vector_store = QdrantVectorStore(client=client, collection_name="collection") # Using the embedding model to Gemini Settings.embed_model = GeminiEmbedding( model_name="models/embedding-001", api_key=GOOGLE_API_KEY ) Settings.llm = Gemini(api_key=GOOGLE_API_KEY) storage_context = StorageContext.from_defaults(vector_store=vector_store) index = VectorStoreIndex( nodes=nodes, storage_context=storage_context, )
使用 Gemini 合成结果并向用户推荐餐厅¶
In [ ]
Copied!
query_engine = index.as_query_engine(
similarity_top_k=1,
)
response = query_engine.query(
"recommend a Orlando restaurant for me and its nearby tourist places"
)
print(response)
query_engine = index.as_query_engine( similarity_top_k=1, ) response = query_engine.query( "recommend a Orlando restaurant for me and its nearby tourist places" ) print(response)
For a delightful dining experience, I recommend Mythos Restaurant, known for its American cuisine and unique underwater theme. Overlooking Universal Studios' Inland Sea, this restaurant offers a captivating ambiance. After your meal, explore the nearby tourist attractions such as Universal's Islands of Adventure, Skull Island: Reign of Kong, The Wizarding World of Harry Potter, Jurassic Park River Adventure, and Hollywood Rip Ride Rockit, all located near Mythos Restaurant.