Zyte Serp 读取器¶

Zyte Serp 读取器允许您访问 Google 搜索的自然结果。给定查询字符串，它提供顶部搜索结果的 URL 以及与这些结果相关的文本字符串。

In [ ]

已复制!

# %pip install llama-index llama-index-readers-zyte-serp
# %pip install llama-index llama-index-readers-zyte-serp

在本 notebook 中，我们将展示如何使用 Zyte Serp 读取器（以及 Web 读取器）来收集特定主题的信息。获取这些文档后，我们可以对该主题执行查询。

最近，爱尔兰政府公布了 2024 财年预算，在此我们将展示如何查询有关预算的信息。首先，我们使用 Zyte Serp 读取器获取相关信息，然后使用 Web 读取器从这些 URL 中提取信息，最后使用 OpenAI ChatGPT 模型回答查询。

In [ ]

已复制!

import os
from llama_index.readers.zyte_serp import ZyteSerpReader
from llama_index.readers.web.zyte_web.base import ZyteWebReader
import os from llama_index.readers.zyte_serp import ZyteSerpReader from llama_index.readers.web.zyte_web.base import ZyteWebReader

In [ ]

已复制!

# This is needed to run it in juypter notebook
# import nest_asyncio
# nest_asyncio.apply()
# 这是在 Jupyter notebook 中运行所需的 # import nest_asyncio # nest_asyncio.apply()

In [ ]

已复制!

zyte_api_key = os.environ.get("ZYTE_API_KEY")
zyte_api_key = os.environ.get("ZYTE_API_KEY")

获取相关资源（使用 ZyteSerp）¶

给定一个主题，我们使用 Google 的搜索结果来获取相关页面的链接。

In [ ]

已复制!

topic = "Ireland Budget 2025"
topic = "Ireland Budget 2025"

In [ ]

已复制!

serp_reader = ZyteSerpReader(api_key=zyte_api_key)
serp_reader = ZyteSerpReader(api_key=zyte_api_key)

In [ ]

已复制!

search_results = serp_reader.load_data(topic)
search_results = serp_reader.load_data(topic)

In [ ]

已复制!

len(search_results)
len(search_results)

Out[ ]

In [ ]

已复制!

for r in search_results[:4]:
    print(r.text)
    print(r.metadata)
for r in search_results[:4]: print(r.text) print(r.metadata)

https://www.gov.ie/en/publication/e8315-budget-2025/
{'name': 'Budget 2025', 'rank': 1}
https://www.citizensinformation.ie/en/money-and-tax/budgets/budget-2025/
{'name': 'Budget 2025', 'rank': 2}
https://www.gov.ie/en/publication/cb193-your-guide-to-budget-2025/
{'name': 'Your guide to Budget 2025', 'rank': 3}
https://www.irishtimes.com/your-money/2024/10/01/budget-2025-ireland-main-points/
{'name': 'Budget 2025 main points: Energy credits, bonus welfare ...', 'rank': 4}

In [ ]

已复制!

urls = [r.text for r in search_results]
urls = [r.text for r in search_results]

看来我们有一个与我们的主题（"Ireland budget 2024"）相关的 URL 列表。元数据还显示了与搜索结果条目关联的文本和排名。接下来，我们使用 Web 读取器获取这些网页的内容。

获取主题内容¶

给定包含主题信息的网页 URL，我们获取其内容。由于网页包含许多不相关的内容，我们可以使用 ZyteWebReader 的“article”模式获取过滤后的内容，该模式只返回网页的文章内容。

In [ ]

已复制!

web_reader = ZyteWebReader(api_key=zyte_api_key, mode="article")
documents = web_reader.load_data(urls)
web_reader = ZyteWebReader(api_key=zyte_api_key, mode="article") documents = web_reader.load_data(urls)

In [ ]

已复制!

print(documents[0].text[:200])
print(documents[0].text[:200])

Budget 2025 - Tax Highlights Ireland

Budget 2025 announced on 1 October 2024 included a substantial "cost-of-living" package including many one-off payments, as well as outlining a framework to direc

In [ ]

已复制!

len(documents)
len(documents)

Out[ ]

查询引擎¶

此处使用 VectorStoreIndex 执行一个非常基础的查询。请确保在运行以下代码之前已设置 OPENAI_API_KEY 环境变量。

In [ ]

已复制!

from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(documents)
from llama_index.core import VectorStoreIndex index = VectorStoreIndex.from_documents(documents)

In [ ]

已复制!

query_engine = index.as_query_engine()
response = query_engine.query(
    "What kind of energy credits are provided in the budget?"
)
print(response)
query_engine = index.as_query_engine() response = query_engine.query( "What kind of energy credits are provided in the budget?" ) print(response)

Two €125 electricity credits will be provided - one this year and one in 2025.