抓取但驗證：使用 Pydantic Validation 抓取數據-Python教學-PHP中文網

首頁

後端開發

Python教學

抓取但驗證：使用 Pydantic Validation 抓取數據

Susan Sarandon

Nov 22, 2024 am 07:40 AM

注意：不是 chatGPT/LLM 的輸出

資料抓取是從公共網路來源收集資料的過程，主要是使用腳本以自動化方式完成。由於自動化，收集的資料常常有錯誤，需要過濾和清理才能使用。不過，如果抓取的資料能夠在抓取過程中得到驗證，那就更好了。

考慮到資料驗證的要求，大多數抓取框架（如Scrapy）都有可用於資料驗證的內建模式。然而，很多時候，在資料抓取過程中，我們經常只使用通用模組，例如 requests 和 beautifulsoup 進行抓取。在這種情況下，很難驗證收集到的數據，因此這篇部落格文章解釋了一種使用 Pydantic 進行資料抓取和驗證的簡單方法。
https://docs.pydantic.dev/latest/
Pydantic 是一個資料驗證 Python 模組。它也是流行的 api 模組 FastAPI 的骨幹，就像 Pydantic 一樣，還有其他 python 模組，可用於資料抓取期間的驗證。然而，這篇部落格探討了 pydantic，這裡是替代套件的連結（您可以嘗試使用任何其他模組更改 pydantic 作為學習練習）

Cerberus 是一個輕量級且可擴充的 Python 資料驗證函式庫。 https://pypi.org/project/Cerberus/

刮痧計劃：

在此部落格中，我們將從報價網站中刪除報價。
我們將使用requests 和beautifulsoup 來獲取資料將創建一個pydantic 資料類別來驗證每個抓取的資料將過濾和驗證的資料保存在json文件中。

為了更好的安排和理解，每個步驟都實作為可以在 main 部分下使用的 python 方法。

基本導入

import requests # for web request
from bs4 import BeautifulSoup # cleaning html content

# pydantic for validation

from pydantic import BaseModel, field_validator, ValidationError

import json

1. 目標站點並取得報價

我們正在使用 (http://quotes.toscrape.com/) 來抓取報價。每個引用將包含三個欄位：quote_text、作者和標籤。例如：

Scrape but Validate: Data scraping with Pydantic Validation

下面的方法是取得給定 url 的 html 內容的通用腳本。

def get_html_content(page_url: str) -> str:
    page_content =""
    # Send a GET request to the website
    response = requests.get(url)
    # Check if the request was successful (status code 200)
    if response.status_code == 200:
        page_content = response.content
    else:
        page_content = f'Failed to retrieve the webpage. Status code: {response.status_code}'
    return page_content

2. 抓取報價數據

我們將使用 requests 和 beautifulsoup 從給定的 url 中抓取資料。流程分為三個部分：1）從網路取得 html 內容 2）為每個目標欄位擷取所需的 html 標籤 3）從每個標籤取得值

import requests # for web request
from bs4 import BeautifulSoup # cleaning html content

# pydantic for validation

from pydantic import BaseModel, field_validator, ValidationError

import json

def get_html_content(page_url: str) -> str:
    page_content =""
    # Send a GET request to the website
    response = requests.get(url)
    # Check if the request was successful (status code 200)
    if response.status_code == 200:
        page_content = response.content
    else:
        page_content = f'Failed to retrieve the webpage. Status code: {response.status_code}'
    return page_content

下面的腳本從每個報價的 div 中取得資料點。

def get_tags(tags):
    tags =[tag.get_text() for tag in tags.find_all('a')]
    return tags

3. 建立 Pydantic 資料類別並驗證每個報價的數據

根據引用的每個字段，建立一個 pydantic 類別並在資料抓取期間使用相同的類別進行資料驗證。

pydantic 模型引用

下面是從 BaseModel 擴展而來的 Quote 類，具有三個字段，如 quote_text、作者和標籤。其中，quote_text 和author 是字串（str）類型，tags 是清單類型。

我們有兩個驗證器方法（有裝飾器）：

1）tags_more_than_two（）：將檢查它是否必須有兩個以上的標籤。（這只是舉例，你可以在這裡有任何規則）

2.) check_quote_text()：此方法將從引用中刪除「」並測試文字。

def get_quotes_div(html_content:str) -> str :    
    # Parse the page content with BeautifulSoup
    soup = BeautifulSoup(html_content, 'html.parser')

    # Find all the quotes on the page
    quotes = soup.find_all('div', class_='quote')

    return quotes

取得和驗證數據

使用 pydantic 進行資料驗證非常簡單，例如下面的程式碼，將抓取的資料傳遞給 pydantic 類別 Quote。

    # Loop through each quote and extract the text and author
    for quote in quotes_div:
        quote_text = quote.find('span', class_='text').get_text()
        author = quote.find('small', class_='author').get_text()
        tags = get_tags(quote.find('div', class_='tags'))

        # yied data to a dictonary 
        quote_temp ={'quote_text': quote_text,
                'author': author,
                'tags':tags
        }

class Quote(BaseModel):
    quote_text:str
    author:str
    tags: list

    @field_validator('tags')
    @classmethod
    def tags_more_than_two(cls, tags_list:list) -> list:
        if len(tags_list)  str:
        return quote_text.removeprefix('“').removesuffix('”')

4. 儲存數據

資料經過驗證後，將儲存到 json 檔案中。（編寫了一個通用方法，將 Python 字典轉換為 json 檔案）

quote_data = Quote(**quote_temp)

將所有內容放在一起

了解了每一個抓取之後，現在，您可以將所有內容放在一起並運行抓取以進行資料收集。

def get_quotes_data(quotes_div: list) -> list:
    quotes_data = []

    # Loop through each quote and extract the text and author
    for quote in quotes_div:
        quote_text = quote.find('span', class_='text').get_text()
        author = quote.find('small', class_='author').get_text()
        tags = get_tags(quote.find('div', class_='tags'))

        # yied data to a dictonary 
        quote_temp ={'quote_text': quote_text,
                'author': author,
                'tags':tags
        }

        # validate data with Pydantic model
        try:
            quote_data = Quote(**quote_temp)            
            quotes_data.append(quote_data.model_dump())            
        except  ValidationError as e:
            print(e.json())
    return quotes_data

注意：計劃進行修訂，請告訴我您的想法或建議，以包含在修訂版本中。

連結與資源：

https://pypi.org/project/parsel/
https://docs.pydantic.dev/latest/

以上是抓取但驗證：使用 Pydantic Validation 抓取數據的詳細內容。更多資訊請關注PHP中文網其他相關文章！

陳述

本文內容由網友自願投稿，版權歸原作者所有。本站不承擔相應的法律責任。如發現涉嫌抄襲或侵權的內容，請聯絡admin@php.cn

您如何將元素附加到Python數組？Apr 30, 2025 am 12:19 AM

Inpython，YouAppendElementStoAlistusingTheAppend（）方法。 1）useappend（）forsingleelements：my_list.append（4）.2）useextend（）orextend（）或= formultiplelements：my_list.extend.extend（emote_list）ormy_list = [4,5,6] .3）useInsert（）forspefificpositions：my_list.insert（1,5）.beaware

您如何調試與Shebang有關的問題？Apr 30, 2025 am 12:17 AM

調試shebang問題的方法包括：1.檢查shebang行確保是腳本首行且無前置空格；2.驗證解釋器路徑是否正確；3.直接調用解釋器運行腳本以隔離shebang問題；4.使用strace或truss跟踪系統調用；5.檢查環境變量對shebang的影響。

如何從python數組中刪除元素？Apr 30, 2025 am 12:16 AM

pythonlistscanbemanipulationusseveralmethodstoremovelements：1）theremove（）MethodRemovestHefirStocCurrenceOfAstePecificiedValue.2）thepop（）thepop（）methodRemovesandReturnturnturnturnsanaNelementAgivenIndex.3）

可以在Python列表中存儲哪些數據類型？Apr 30, 2025 am 12:07 AM

pythonlistscanstoreanydatate型，包括素，弦，浮子，布爾人，其他列表和迪克尼亞式

在Python列表上可以執行哪些常見操作？Apr 30, 2025 am 12:01 AM

pythristssupportnumeroferations：1）addingElementSwithAppend（），Extend（），andInsert（）。 2）emovingItemSusingRemove（），pop（），andclear（），and clear（）。 3）訪問andModifyingandmodifyingwithIndexingandSlicing.4）

如何使用numpy創建多維數組？Apr 29, 2025 am 12:27 AM

使用NumPy創建多維數組可以通過以下步驟實現：1)使用numpy.array()函數創建數組，例如np.array([[1,2,3],[4,5,6]])創建2D數組；2)使用np.zeros(),np.ones(),np.random.random()等函數創建特定值填充的數組；3)理解數組的shape和size屬性，確保子數組長度一致，避免錯誤；4)使用np.reshape()函數改變數組形狀；5)注意內存使用，確保代碼清晰高效。

說明Numpy陣列中'廣播”的概念。Apr 29, 2025 am 12:23 AM

播放innumpyisamethodtoperformoperationsonArraySofDifferentsHapesbyAutapityallate AligningThem.itSimplifififiesCode，增強可讀性，和Boostsperformance.Shere'shore'showitworks：1）較小的ArraySaraySaraysAraySaraySaraySaraySarePaddedDedWiteWithOnestOmatchDimentions.2）

說明如何在列表，Array.Array和用於數據存儲的Numpy數組之間進行選擇。Apr 29, 2025 am 12:20 AM

forpythondataTastorage，choselistsforflexibilityWithMixedDatatypes，array.ArrayFormeMory-effficityHomogeneousnumericalData，andnumpyArraysForAdvancedNumericalComputing.listsareversareversareversareversArversatilebutlessEbutlesseftlesseftlesseftlessforefforefforefforefforefforefforefforefforefforlargenumerdataSets; arrayoffray.array.array.array.array.array.ersersamiddreddregro

See all articles