使用 Google Gemini 模型进行图像理解并使用 LlamaIndex 构建检索增强生成的多模态大型语言模型¶

在此笔记本中，我们展示如何使用 Google 的 Gemini Vision 模型进行图像理解。

首先，我们展示我们现在支持 Gemini 的几个功能

complete（同步和异步）：用于单个提示和图像列表
chat（同步和异步）：用于多个聊天消息
stream complete（同步和异步）：用于完整输出的流式传输
stream chat（同步和异步）：用于聊天输出的流式传输

此笔记本的第二部分，我们尝试使用 Gemini + Pydantic 来解析 Google 地图中图像的结构化信息。

定义带有属性字段的期望 Pydantic 类
让 gemini-pro-vision 模型理解每张图像并输出结构化结果

此笔记本的第三部分，我们建议使用 Gemini & LlamaIndex 为小型 Google 地图餐厅数据集构建一个简单的检索增强生成流程。

根据步骤 2 的结构化输出构建向量索引
使用 gemini-pro 模型合成结果并根据用户查询推荐餐厅。

注意：google-generativeai 仅在某些国家和地区可用。

In [ ]

Copied!

%pip install llama-index-multi-modal-llms-gemini
%pip install llama-index-vector-stores-qdrant
%pip install llama-index-embeddings-gemini
%pip install llama-index-llms-gemini
%pip install llama-index-multi-modal-llms-gemini %pip install llama-index-vector-stores-qdrant %pip install llama-index-embeddings-gemini %pip install llama-index-llms-gemini

In [ ]

Copied!

!pip install llama-index 'google-generativeai>=0.3.0' matplotlib qdrant_client
!pip install llama-index 'google-generativeai>=0.3.0' matplotlib qdrant_client

使用 Gemini 理解 URL 中的图像¶

In [ ]

Copied!

%env GOOGLE_API_KEY=...
%env GOOGLE_API_KEY=...

In [ ]

Copied!

import os

GOOGLE_API_KEY = ""  # add your GOOGLE API key here
os.environ["GOOGLE_API_KEY"] = GOOGLE_API_KEY
import os GOOGLE_API_KEY = "" # 在此处添加您的 GOOGLE API 密钥 os.environ["GOOGLE_API_KEY"] = GOOGLE_API_KEY

初始化 `GeminiMultiModal` 并从 URL 加载图像¶

In [ ]

Copied!





from llama_index.llms.gemini import Gemini
from llama_index.core.llms import ChatMessage, ImageBlock


image_urls = [
    "https://storage.googleapis.com/generativeai-downloads/data/scene.jpg",
    # Add yours here!
]
gemini_pro = Gemini(model_name="models/gemini-1.5-flash")
msg = ChatMessage("Identify the city where this photo was taken.")
for img_url in image_urls:
    msg.blocks.append(ImageBlock(url=img_url))
from llama_index.llms.gemini import Gemini from llama_index.core.llms import ChatMessage, ImageBlock image_urls = [ "https://storage.googleapis.com/generativeai-downloads/data/scene.jpg", # 在此处添加您的！ ] gemini_pro = Gemini(model_name="models/gemini-1.5-flash") msg = ChatMessage("Identify the city where this photo was taken.") for img_url in image_urls: msg.blocks.append(ImageBlock(url=img_url))

In [ ]

Copied!





from PIL import Image
import requests
from io import BytesIO
import matplotlib.pyplot as plt

img_response = requests.get(image_urls[0])
print(image_urls[0])
img = Image.open(BytesIO(img_response.content))
plt.imshow(img)
from PIL import Image import requests from io import BytesIO import matplotlib.pyplot as plt img_response = requests.get(image_urls[0]) print(image_urls[0]) img = Image.open(BytesIO(img_response.content)) plt.imshow(img)

https://storage.googleapis.com/generativeai-downloads/data/scene.jpg

Out[ ]

<matplotlib.image.AxesImage at 0x128032e40>

No description has been provided for this image

在提示中使用图像进行聊天¶

In [ ]

Copied!

response = gemini_pro.chat(messages=[msg])
response = gemini_pro.chat(messages=[msg])

In [ ]

Copied!

print(response.message.content)
print(response.message.content)

That's New York City.  More specifically, the photo shows a street in the **SoHo** neighborhood.  The distinctive cast-iron architecture and the pedestrian bridge are characteristic of that area.

使用图像流式聊天¶

In [ ]

Copied!

stream_response = gemini_pro.stream_chat(messages=[msg])
stream_response = gemini_pro.stream_chat(messages=[msg])

In [ ]

Copied!

import time

for r in stream_response:
    print(r.delta, end="")
    # Add an artificial wait to make streaming visible in the notebook
    time.sleep(0.5)
import time for r in stream_response: print(r.delta, end="") # 添加人工等待以使流式传输在笔记本中可见 time.sleep(0.5)

That's New York City.  More specifically, the photo was taken in the **West Village** neighborhood of Manhattan.  The distinctive architecture and the pedestrian bridge are strong clues.

异步支持¶

In [ ]

Copied!

response_achat = await gemini_pro.achat(messages=[msg])
response_achat = await gemini_pro.achat(messages=[msg])

In [ ]

Copied!

print(response_achat.message.content)
print(response_achat.message.content)

That's New York City.  More specifically, the photo was taken in the **West Village** neighborhood of Manhattan.  The distinctive architecture and the pedestrian bridge are strong clues.

让我们看看如何异步流式传输

In [ ]

Copied!

import asyncio

streaming_handler = await gemini_pro.astream_chat(messages=[msg])
async for chunk in streaming_handler:
    print(chunk.delta, end="")
    # Add an artificial wait to make streaming visible in the notebook
    await asyncio.sleep(0.5)
import asyncio streaming_handler = await gemini_pro.astream_chat(messages=[msg]) async for chunk in streaming_handler: print(chunk.delta, end="") # 添加人工等待以使流式传输在笔记本中可见 await asyncio.sleep(0.5)

That's New York City.  More specifically, the photo was taken in the **West Village** neighborhood of Manhattan.  The distinctive architecture and the pedestrian bridge are strong clues.

使用两张图像完成¶

In [ ]

Copied!

image_urls = [
    "https://picsum.photos/id/1/200/300",
    "https://picsum.photos/id/26/200/300",
]

msg = ChatMessage("Is there any relationship between these images?")
for img_url in image_urls:
    msg.blocks.append(ImageBlock(url=img_url))

response_multi = gemini_pro.chat(messages=[msg])
image_urls = [ "https://picsum.photos/id/1/200/300", "https://picsum.photos/id/26/200/300", ] msg = ChatMessage("Is there any relationship between these images?") for img_url in image_urls: msg.blocks.append(ImageBlock(url=img_url)) response_multi = gemini_pro.chat(messages=[msg])

In [ ]

Copied!

print(response_multi.message.content)
print(response_multi.message.content)

Yes, there is a relationship between the two images.  Both images depict aspects of a **professional or business-casual lifestyle**.

* **Image 1:** Shows someone working on a laptop, suggesting remote work, freelancing, or a business-related task.

* **Image 2:** Shows a flat lay of accessories commonly associated with a professional or stylish individual: sunglasses, a bow tie, a pen, a watch, glasses, and a phone.  These items suggest a certain level of personal style and preparedness often associated with business or professional settings.

The connection is indirect but thematic.  They both visually represent elements of a similar lifestyle or persona.

第二部分：`Gemini` + `Pydantic` 用于从图像解析结构化输出¶

利用 Gemini 进行图像推理
使用 Pydantic 程序从 Gemini 的图像推理结果中生成结构化输出

下载示例图像供 Gemini 理解¶

In [ ]

Copied!

from pathlib import Path

input_image_path = Path("google_restaurants")
if not input_image_path.exists():
    Path.mkdir(input_image_path)
from pathlib import Path input_image_path = Path("google_restaurants") if not input_image_path.exists(): Path.mkdir(input_image_path)

In [ ]

Copied!

!curl -sL "https://docs.google.com/uc?export=download&id=1Pg04p6ss0FlBgz00noHAOAJ1EYXiosKg" -o ./google_restaurants/miami.png
!curl -sL "https://docs.google.com/uc?export=download&id=1dYZy17bD6pSsEyACXx9fRMNx93ok-kTJ" -o ./google_restaurants/orlando.png
!curl -sL "https://docs.google.com/uc?export=download&id=1ShPnYVc1iL_TA1t7ErCFEAHT74-qvMrn" -o ./google_restaurants/sf.png
!curl -sL "https://docs.google.com/uc?export=download&id=1WjISWnatHjwL4z5VD_9o09ORWhRJuYqm" -o ./google_restaurants/toronto.png
!curl -sL "https://docs.google.com/uc?export=download&id=1Pg04p6ss0FlBgz00noHAOAJ1EYXiosKg" -o ./google_restaurants/miami.png !curl -sL "https://docs.google.com/uc?export=download&id=1dYZy17bD6pSsEyACXx9fRMNx93ok-kTJ" -o ./google_restaurants/orlando.png !curl -sL "https://docs.google.com/uc?export=download&id=1ShPnYVc1iL_TA1t7ErCFEAHT74-qvMrn" -o ./google_restaurants/sf.png !curl -sL "https://docs.google.com/uc?export=download&id=1WjISWnatHjwL4z5VD_9o09ORWhRJuYqm" -o ./google_restaurants/toronto.png

为结构化解析器定义 Pydantic 类¶

In [ ]

Copied!





from pydantic import BaseModel
from PIL import Image
import matplotlib.pyplot as plt


class GoogleRestaurant(BaseModel):
    """Data model for a Google Restaurant."""

    restaurant: str
    food: str
    location: str
    category: str
    hours: str
    price: str
    rating: float
    review: str
    description: str
    nearby_tourist_places: str


google_image_url = "./google_restaurants/miami.png"
image = Image.open(google_image_url).convert("RGB")

plt.figure(figsize=(16, 5))
plt.imshow(image)
from pydantic import BaseModel from PIL import Image import matplotlib.pyplot as plt class GoogleRestaurant(BaseModel): """Google Restaurant 的数据模型。""" restaurant: str food: str location: str category: str hours: str price: str rating: float review: str description: str nearby_tourist_places: str google_image_url = "./google_restaurants/miami.png" image = Image.open(google_image_url).convert("RGB") plt.figure(figsize=(16, 5)) plt.imshow(image)

Out[ ]

<matplotlib.image.AxesImage at 0x10953cce0>

调用 Pydantic 程序并生成结构化输出¶

In [ ]

Copied!





from llama_index.multi_modal_llms.gemini import GeminiMultiModal
from llama_index.core.program import MultiModalLLMCompletionProgram
from llama_index.core.output_parsers import PydanticOutputParser

prompt_template_str = """\
    can you summarize what is in the image\
    and return the answer with json format \
"""


def pydantic_gemini(
    model_name, output_class, image_documents, prompt_template_str
):
    gemini_llm = GeminiMultiModal(model_name=model_name)

    llm_program = MultiModalLLMCompletionProgram.from_defaults(
        output_parser=PydanticOutputParser(output_class),
        image_documents=image_documents,
        prompt_template_str=prompt_template_str,
        multi_modal_llm=gemini_llm,
        verbose=True,
    )

    response = llm_program()
    return response
from llama_index.multi_modal_llms.gemini import GeminiMultiModal from llama_index.core.program import MultiModalLLMCompletionProgram from llama_index.core.output_parsers import PydanticOutputParser prompt_template_str = """\ can you summarize what is in the image\ and return the answer with json format \ """ def pydantic_gemini( model_name, output_class, image_documents, prompt_template_str ): gemini_llm = GeminiMultiModal(model_name=model_name) llm_program = MultiModalLLMCompletionProgram.from_defaults( output_parser=PydanticOutputParser(output_class), image_documents=image_documents, prompt_template_str=prompt_template_str, multi_modal_llm=gemini_llm, verbose=True, ) response = llm_program() return response

通过 Gemini Vision 模型生成 Pydantic 结构化输出¶

In [ ]

Copied!





from llama_index.core import SimpleDirectoryReader

google_image_documents = SimpleDirectoryReader(
    "./google_restaurants"
).load_data()

results = []
for img_doc in google_image_documents:
    pydantic_response = pydantic_gemini(
        "models/gemini-1.5-flash",
        GoogleRestaurant,
        [img_doc],
        prompt_template_str,
    )
    # only output the results for miami for example along with image
    if "miami" in img_doc.image_path:
        for r in pydantic_response:
            print(r)
    results.append(pydantic_response)
from llama_index.core import SimpleDirectoryReader google_image_documents = SimpleDirectoryReader( "./google_restaurants" ).load_data() results = [] for img_doc in google_image_documents: pydantic_response = pydantic_gemini( "models/gemini-1.5-flash", GoogleRestaurant, [img_doc], prompt_template_str, ) # 例如，仅输出迈阿密的结果以及图像 if "miami" in img_doc.image_path: for r in pydantic_response: print(r) results.append(pydantic_response)

> Raw output: ```json
{
  "restaurant": "La Mar by Gaston Acurio",
  "food": "Peruvian & fusion",
  "location": "500 Brickell Key Dr, Miami, FL 33131",
  "category": "South American restaurant",
  "hours": "Opens 6PM, Closes 11 PM",
  "price": "$$$",
  "rating": 4.4,
  "review": "Chic waterfront offering Peruvian & fusion fare, plus bars for cocktails, ceviche & anticuchos.",
  "description": "Chic waterfront offering Peruvian & fusion fare, plus bars for cocktails, ceviche & anticuchos.",
  "nearby_tourist_places": "Brickell Key area with scenic views"
}
```

('restaurant', 'La Mar by Gaston Acurio')
('food', 'Peruvian & fusion')
('location', '500 Brickell Key Dr, Miami, FL 33131')
('category', 'South American restaurant')
('hours', 'Opens 6PM, Closes 11 PM')
('price', '$$$')
('rating', 4.4)
('review', 'Chic waterfront offering Peruvian & fusion fare, plus bars for cocktails, ceviche & anticuchos.')
('description', 'Chic waterfront offering Peruvian & fusion fare, plus bars for cocktails, ceviche & anticuchos.')
('nearby_tourist_places', 'Brickell Key area with scenic views')
> Raw output: ```json
{
  "restaurant": "Mythos Restaurant",
  "food": "American fare in a mythic underwater themed spot",
  "location": "6000 Universal Blvd, Orlando, FL 32819, United States",
  "category": "Restaurant",
  "hours": "Open: Closes in 7 hrs, Islands of Adventure",
  "price": "$$",
  "rating": 4.3,
  "review": "Overlooking Universal Studios/Island sea, this mythic underwater themed spot serves American fare.",
  "description": "Dine-in, Delivery",
  "nearby_tourist_places": "Universal Islands, Jurassic Park River Adventure"
}
```

> Raw output: ```json
{
  "restaurant": "Sam's Grill & Seafood Restaurant",
  "food": "Seafood",
  "location": "374 Bush St, San Francisco, CA 94104, United States",
  "category": "Seafood Restaurant",
  "hours": "Open ⋅ Closes 8:30 PM",
  "price": "$$$",
  "rating": 4.4,
  "review": "Modern spin-off adjacent Sam's Grill, for seafood, drinks & happy hour loungey digs with a patio.",
  "description": "Modern spin-off adjacent Sam's Grill, for seafood, drinks & happy hour loungey digs with a patio.",
  "nearby_tourist_places": "Chinatown, San Francisco"
}
```
> Raw output: ```json
{
  "restaurant": "Lobster Port",
  "food": "Seafood restaurant offering lobster, dim sum & Asian fusion dishes",
  "location": "8432 Leslie St, Thornhill, ON L3T 7M6",
  "category": "Seafood",
  "hours": "Open 10pm",
  "price": "$$",
  "rating": 4.0,
  "review": "Elegant, lively venue with a banquet-hall setup",
  "description": "Elegant, lively venue with a banquet-hall setup offering lobster, dim sum & Asian fusion dishes.",
  "nearby_tourist_places": "Nearby tourist places are not explicitly listed in the image but the map shows various points of interest in the surrounding area."
}
```

观察:

Gemini 完美地生成了 Pydantic 类所需的所有元信息
它也能识别 Google 地图附近的公园

我们的技术栈包括 Gemini + LlamaIndex + Pydantic 结构化输出能力

构建文本节点用于构建向量存储。存储每个餐厅的元数据和描述。¶

In [ ]

Copied!





from llama_index.core.schema import TextNode

nodes = []
for res in results:
    text_node = TextNode()
    metadata = {}
    for r in res:
        # set description as text of TextNode
        if r[0] == "description":
            text_node.text = r[1]
        else:
            metadata[r[0]] = r[1]
    text_node.metadata = metadata
    nodes.append(text_node)
from llama_index.core.schema import TextNode nodes = [] for res in results: text_node = TextNode() metadata = {} for r in res: # set description as text of TextNode if r[0] == "description": text_node.text = r[1] else: metadata[r[0]] = r[1] text_node.metadata = metadata nodes.append(text_node)

使用 Gemini Embedding 构建向量存储以进行密集检索。将餐厅作为节点索引到向量存储中¶

In [ ]

Copied!





from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.embeddings.gemini import GeminiEmbedding
from llama_index.llms.gemini import Gemini
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.core import Settings
from llama_index.core import StorageContext
import qdrant_client


# Create a local Qdrant vector store
client = qdrant_client.QdrantClient(path="qdrant_gemini_3")

vector_store = QdrantVectorStore(client=client, collection_name="collection")

# Using the embedding model to Gemini
Settings.embed_model = GeminiEmbedding(
    model_name="models/embedding-001", api_key=GOOGLE_API_KEY
)
Settings.llm = Gemini(api_key=GOOGLE_API_KEY)

storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex(
    nodes=nodes,
    storage_context=storage_context,
)
from llama_index.core import VectorStoreIndex, StorageContext from llama_index.embeddings.gemini import GeminiEmbedding from llama_index.llms.gemini import Gemini from llama_index.vector_stores.qdrant import QdrantVectorStore from llama_index.core import Settings from llama_index.core import StorageContext import qdrant_client # Create a local Qdrant vector store client = qdrant_client.QdrantClient(path="qdrant_gemini_3") vector_store = QdrantVectorStore(client=client, collection_name="collection") # Using the embedding model to Gemini Settings.embed_model = GeminiEmbedding( model_name="models/embedding-001", api_key=GOOGLE_API_KEY ) Settings.llm = Gemini(api_key=GOOGLE_API_KEY) storage_context = StorageContext.from_defaults(vector_store=vector_store) index = VectorStoreIndex( nodes=nodes, storage_context=storage_context, )

In [ ]

Copied!





query_engine = index.as_query_engine(
    similarity_top_k=1,
)

response = query_engine.query(
    "recommend a Orlando restaurant for me and its nearby tourist places"
)
print(response)
query_engine = index.as_query_engine( similarity_top_k=1, ) response = query_engine.query( "recommend a Orlando restaurant for me and its nearby tourist places" ) print(response)

For a delightful dining experience, I recommend Mythos Restaurant, known for its American cuisine and unique underwater theme. Overlooking Universal Studios' Inland Sea, this restaurant offers a captivating ambiance. After your meal, explore the nearby tourist attractions such as Universal's Islands of Adventure, Skull Island: Reign of Kong, The Wizarding World of Harry Potter, Jurassic Park River Adventure, and Hollywood Rip Ride Rockit, all located near Mythos Restaurant.

使用 Google Gemini 模型进行图像理解并使用 LlamaIndex 构建检索增强生成的多模态大型语言模型¶

使用 Gemini 理解 URL 中的图像¶

初始化 GeminiMultiModal 并从 URL 加载图像¶

在提示中使用图像进行聊天¶

使用图像流式聊天¶

异步支持¶

使用两张图像完成¶

第二部分：Gemini + Pydantic 用于从图像解析结构化输出¶

下载示例图像供 Gemini 理解¶

为结构化解析器定义 Pydantic 类¶

调用 Pydantic 程序并生成结构化输出¶

通过 Gemini Vision 模型生成 Pydantic 结构化输出¶

第三部分：为餐厅推荐构建多模态 RAG¶

构建文本节点用于构建向量存储。存储每个餐厅的元数据和描述。¶

使用 Gemini Embedding 构建向量存储以进行密集检索。将餐厅作为节点索引到向量存储中¶

使用 Gemini 合成结果并向用户推荐餐厅¶

初始化 `GeminiMultiModal` 并从 URL 加载图像¶

第二部分：`Gemini` + `Pydantic` 用于从图像解析结构化输出¶