使用 Pydantic、Crawl 和 Gemini 建立非同步電子商務網路爬蟲-Python教學-PHP中文網

首頁

後端開發

Python教學

使用 Pydantic、Crawl 和 Gemini 建立非同步電子商務網路爬蟲

Mary-Kate Olsen

Jan 12, 2025 am 06:25 AM

Building an Async E-Commerce Web Scraper with Pydantic, Crawl & Gemini

簡而言之： 本指南示範如何使用crawl4ai 的人工智慧擷取和 Pydantic 資料模型建立電子商務抓取工具。抓取工具非同步檢索產品清單（名稱、價格）和詳細的產品資訊（規格、評論）。

在 Google Colab 上存取完整程式碼

厭倦了電子商務資料分析的傳統網路抓取的複雜性？本教學使用現代 Python 工具簡化了這個過程。我們將利用 crawl4ai 進行智慧資料擷取，並利用 Pydantic 進行穩健的資料建模和驗證。

為什麼選擇 Crawl4AI 和 Pydantic？

crawl4ai：使用人工智慧驅動的提取方法來簡化網路爬行和抓取。
Pydantic：提供資料驗證和模式管理，確保抓取的資料結構化且準確。

為什麼瞄準 Tokopedia？

印尼主要電商平台Tokopedia就是我們的例子。（註：作者是印尼人，也是該平台的用戶，但不隸屬於該平台。）這些原則適用於其他電子商務網站。這種抓取方法對於對電子商務分析、市場研究或自動資料收集感興趣的開發人員來說是有益的。

是什麼讓這種方法與眾不同？

我們不依賴複雜的CSS選擇器或XPath，而是利用crawl4ai基於LLM的提取。這提供：

增強了對網站結構變化的適應能力。
更清晰、更結構化的資料輸出。
減少維修開銷。

設定您的開發環境

先安裝必要的軟體包：

%pip install -U crawl4ai
%pip install nest_asyncio
%pip install pydantic

對於筆記本中的非同步程式碼執行，我們也會使用 nest_asyncio:

import crawl4ai
import asyncio
import nest_asyncio
nest_asyncio.apply()

使用 Pydantic 定義資料模型

我們使用 Pydantic 來定義預期的資料結構。以下是型號：

from pydantic import BaseModel, Field
from typing import List, Optional

class TokopediaListingItem(BaseModel):
    product_name: str = Field(..., description="Product name from listing.")
    product_url: str = Field(..., description="URL to product detail page.")
    price: str = Field(None, description="Price displayed in listing.")
    store_name: str = Field(None, description="Store name from listing.")
    rating: str = Field(None, description="Rating (1-5 scale) from listing.")
    image_url: str = Field(None, description="Primary image URL from listing.")

class TokopediaProductDetail(BaseModel):
    product_name: str = Field(..., description="Product name from detail page.")
    all_images: List[str] = Field(default_factory=list, description="List of all product image URLs.")
    specs: str = Field(None, description="Technical specifications or short info.")
    description: str = Field(None, description="Long product description.")
    variants: List[str] = Field(default_factory=list, description="List of variants or color options.")
    satisfaction_percentage: Optional[str] = Field(None, description="Customer satisfaction percentage.")
    total_ratings: Optional[str] = Field(None, description="Total number of ratings.")
    total_reviews: Optional[str] = Field(None, description="Total number of reviews.")
    stock: Optional[str] = Field(None, description="Stock availability.")

這些模型充當模板，確保資料驗證並提供清晰的文件。

抓取過程

刮刀分兩階段運作：

1.抓取產品清單

首先，我們先檢索搜尋結果頁：

async def crawl_tokopedia_listings(query: str = "mouse-wireless", max_pages: int = 1):
    # ... (Code remains the same) ...

2.正在取得產品詳細資訊

接下來，對於每個產品 URL，我們會取得詳細資訊：

async def crawl_tokopedia_detail(product_url: str):
    # ... (Code remains the same) ...

結合各階段

最後，我們整合兩個階段：

async def run_full_scrape(query="mouse-wireless", max_pages=2, limit=15):
    # ... (Code remains the same) ...

運行爬蟲

執行抓取工具的方法如下：

%pip install -U crawl4ai
%pip install nest_asyncio
%pip install pydantic

專業提示

速率限制：尊重 Tokopedia 的伺服器；在大規模抓取請求之間引入延遲。
快取：在開發過程中啟用crawl4ai的快取（cache_mode=CacheMode.ENABLED）。
錯誤處理：為生產使用實現全面的錯誤處理和重試機制。
API 金鑰： 將 Gemini API 金鑰安全地儲存在環境變數中，而不是直接儲存在程式碼中。

後續步驟

這個刮刀可以擴展到：

將資料儲存在資料庫中。
監控價格隨時間的變化。
分析產品趨勢與模式。
比較多家商店的價格。

結論

crawl4ai 基於 LLM 的提取與傳統方法相比顯著提高了網頁抓取的可維護性。與 Pydantic 的整合確保了數據的準確性和結構。

在抓取之前始終遵守網站的robots.txt和服務條款。

重要連結：

爬4AI

官方網站：https://www.php.cn/link/1026d8c97a822ee171c6cbf939fe4aca
GitHub 儲存庫：https://www.php.cn/link/62c1b075041300455ec2b54495d93c99
文檔：https://www.php.cn/link/1026d8c97a822ee171c6cbf939fe4aca/mkdocs/core/installation/

皮丹蒂克

官方文件：https://www.php.cn/link/a4d4ec4aa3c45731396ed6e65fee40b9
PyPI 頁面：https://www.php.cn/link/4d8ab89733dd9a88f1a9d130ca675c2e
GitHub 倉庫：https://www.php.cn/link/22935fba49f7d80d5adf1cfa6b0344f4

注意：完整的程式碼可以在Colab筆記本中找到。請隨意嘗試並根據您的具體需求進行調整。

以上是使用 Pydantic、Crawl 和 Gemini 建立非同步電子商務網路爬蟲的詳細內容。更多資訊請關注PHP中文網其他相關文章！

陳述

本文內容由網友自願投稿，版權歸原作者所有。本站不承擔相應的法律責任。如發現涉嫌抄襲或侵權的內容，請聯絡admin@php.cn

python中兩個列表的串聯替代方案是什麼？May 09, 2025 am 12:16 AM

可以使用多種方法在Python中連接兩個列表：1.使用操作符，簡單但在大列表中效率低；2.使用extend方法，效率高但會修改原列表；3.使用 =操作符，兼具效率和可讀性；4.使用itertools.chain函數，內存效率高但需額外導入；5.使用列表解析，優雅但可能過於復雜。選擇方法應根據代碼上下文和需求。

Python：合併兩個列表的有效方法May 09, 2025 am 12:15 AM

有多種方法可以合併Python列表：1.使用操作符，簡單但對大列表不內存高效；2.使用extend方法，內存高效但會修改原列表；3.使用itertools.chain，適用於大數據集；4.使用*操作符，一行代碼合併小到中型列表；5.使用numpy.concatenate，適用於大數據集和性能要求高的場景；6.使用append方法，適用於小列表但效率低。選擇方法時需考慮列表大小和應用場景。

編譯的與解釋的語言：優點和缺點May 09, 2025 am 12:06 AM

CompiledLanguagesOffersPeedAndSecurity，而interneterpretledlanguages provideeaseafuseanDoctability.1）commiledlanguageslikec arefasterandSecureButhOnderDevevelmendeclementCyclesclesclesclesclesclesclesclesclesclesclesclesclesclesclesclesclesclesandentency.2）cransportedeplatectentysenty

Python：對於循環，最完整的指南May 09, 2025 am 12:05 AM

Python中，for循環用於遍歷可迭代對象，while循環用於條件滿足時重複執行操作。 1）for循環示例：遍歷列表並打印元素。 2）while循環示例：猜數字遊戲，直到猜對為止。掌握循環原理和優化技巧可提高代碼效率和可靠性。

python concatenate列表到一個字符串中May 09, 2025 am 12:02 AM

要將列表連接成字符串，Python中使用join()方法是最佳選擇。 1)使用join()方法將列表元素連接成字符串，如''.join(my_list)。 2)對於包含數字的列表，先用map(str,numbers)轉換為字符串再連接。 3)可以使用生成器表達式進行複雜格式化，如','.join(f'({fruit})'forfruitinfruits)。 4)處理混合數據類型時，使用map(str,mixed_list)確保所有元素可轉換為字符串。 5)對於大型列表，使用''.join(large_li

Python的混合方法：編譯和解釋合併May 08, 2025 am 12:16 AM

pythonuseshybridapprace，ComminingCompilationTobyTecoDeAndInterpretation.1）codeiscompiledtoplatform-Indepententbybytecode.2）bytecodeisisterpretedbybythepbybythepythonvirtualmachine，增強效率和通用性。

了解python的' for”和' then”循環之間的差異May 08, 2025 am 12:11 AM

theKeyDifferencesBetnewpython's“ for”和“ for”和“ loopsare：1）” for“ loopsareIdealForiteringSequenceSquencesSorkNowniterations，而2）”，而“ loopsareBetterforConterContinuingUntilacTientInditionIntionismetismetistismetistwithOutpredefinedInedIterations.un