首頁  >  文章  >  後端開發  >  基於設計原則的重構:資料擷取爬蟲系統範例

基於設計原則的重構:資料擷取爬蟲系統範例

WBOY
WBOY原創
2024-07-20 07:38:49731瀏覽

Refactoring based on design principles: example of a data collection crawler system

介紹

提高程式碼品質始終是軟體開發中的重要議題。在本文中,我們以資料收集爬蟲系統為例,具體講解如何透過逐步重建來應用設計原則和最佳實務。

改進前的程式碼

首先,我們從一個非常簡單的網頁抓取工具開始,將所有功能整合到一個類別中。

由 DeepL.com 翻譯(免費版)

project_root/
├── web_scraper.py
├── main.py
└── requirements.txt

web_scraper.py

import requests
import json
import sqlite3

class WebScraper:
    def __init__(self, url):
        self.url = url

    def fetch_data(self):
        response = requests.get(self.url)
        data = response.text
        parsed_data = self.parse_data(data)
        enriched_data = self.enrich_data(parsed_data)
        self.save_data(enriched_data)
        return enriched_data

    def parse_data(self, data):
        return json.loads(data)

    def enrich_data(self, data):
        # Apply business logic here
        # Example: extract only data containing specific keywords
        return {k: v for k, v in data.items() if 'important' in v.lower()}

    def save_data(self, data):
        conn = sqlite3.connect('test.db')
        cursor = conn.cursor()
        cursor.execute('INSERT INTO data (json_data) VALUES (?)', (json.dumps(data),))
        conn.commit()
        conn.close()

main.py

from web_scraper import WebScraper

def main():
    scraper = WebScraper('https://example.com/api/data')
    data = scraper.fetch_data()
    print(data)

if __name__ == "__main__":
    main()

需要改進的地方

  1. 違反了單一職責原則:一個類別負責所有資料收集、分析、豐富和儲存
  2. 業務邏輯不清晰:業務邏輯嵌入在enrich_data方法中,但與其他處理混合
  3. 缺乏可重複使用性:函數緊密耦合,使得單獨重複使用變得困難
  4. 測試困難:難以獨立測試各個功能
  5. 配置剛性:資料庫路徑和其他設定直接嵌入程式碼

重構階段

1.職責分離:資料採集、分析、儲存分離

  • 重大變化:將資料收集、分析和儲存的職責分離到不同的類別中
  • 目標:應用單一責任原則,引入環境變數

目錄結構

project_root/
├── data_fetcher.py
├── data_parser.py
├── data_saver.py
├── data_enricher.py
├── web_scraper.py
├── main.py
└── requirements.txt

data_enricher.py

class DataEnricher:
    def enrich(self, data):
        return {k: v for k, v in data.items() if 'important' in v.lower()}

web_scraper.py

from data_fetcher import DataFetcher
from data_parser import DataParser
from data_enricher import DataEnricher
from data_saver import DataSaver

class WebScraper:
    def __init__(self, url):
        self.url = url
        self.fetcher = DataFetcher()
        self.parser = DataParser()
        self.enricher = DataEnricher()
        self.saver = DataSaver()

    def fetch_data(self):
        raw_data = self.fetcher.fetch(self.url)
        parsed_data = self.parser.parse(raw_data)
        enriched_data = self.enricher.enrich(parsed_data)
        self.saver.save(enriched_data)
        return enriched_data

此變更明確了每個類別的職責並提高了可重複使用性和可測試性。然而,業務邏輯仍然嵌入在 DataEnricher 類別中。

2.介面介紹與依賴注入

  • 主要變化:引入介面並實現依賴注入。
  • 目的:增加靈活性和可擴展性,擴展環境變量,抽象業務邏輯

目錄結構

project_root/
├── interfaces/
│   ├── __init__.py
│   ├── data_fetcher_interface.py
│   ├── data_parser_interface.py
│   ├── data_enricher_interface.py
│   └── data_saver_interface.py
├── implementations/
│   ├── __init__.py
│   ├── http_data_fetcher.py
│   ├── json_data_parser.py
│   ├── keyword_data_enricher.py
│   └── sqlite_data_saver.py
├── web_scraper.py
├── main.py
└── requirements.txt

介面/data_fetcher_interface.py

from abc import ABC, abstractmethod

class DataFetcherInterface(ABC):
    @abstractmethod
    def fetch(self, url: str) -> str:
        pass

介面/data_parser_interface.py

from abc import ABC, abstractmethod
from typing import Dict, Any

class DataParserInterface(ABC):
    @abstractmethod
    def parse(self, raw_data: str) -> Dict[str, Any]:
        pass

介面/data_enricher_interface.py

from abc import ABC, abstractmethod
from typing import Dict, Any

class DataEnricherInterface(ABC):
    @abstractmethod
    def enrich(self, data: Dict[str, Any]) -> Dict[str, Any]:
        pass

介面/data_saver_interface.py

from abc import ABC, abstractmethod
from typing import Dict, Any

class DataSaverInterface(ABC):
    @abstractmethod
    def save(self, data: Dict[str, Any]) -> None:
        pass

實作/keyword_data_enricher.py

import os
from interfaces.data_enricher_interface import DataEnricherInterface

class KeywordDataEnricher(DataEnricherInterface):
    def __init__(self):
        self.keyword = os.getenv('IMPORTANT_KEYWORD', 'important')

    def enrich(self, data):
        return {k: v for k, v in data.items() if self.keyword in str(v).lower()}

web_scraper.py

from interfaces.data_fetcher_interface import DataFetcherInterface
from interfaces.data_parser_interface import DataParserInterface
from interfaces.data_enricher_interface import DataEnricherInterface
from interfaces.data_saver_interface import DataSaverInterface

class WebScraper:
    def __init__(self, fetcher: DataFetcherInterface, parser: DataParserInterface, 
                 enricher: DataEnricherInterface, saver: DataSaverInterface):
        self.fetcher = fetcher
        self.parser = parser
        self.enricher = enricher
        self.saver = saver

    def fetch_data(self, url):
        raw_data = self.fetcher.fetch(url)
        parsed_data = self.parser.parse(raw_data)
        enriched_data = self.enricher.enrich(parsed_data)
        self.saver.save(enriched_data)
        return enriched_data

現階段主要變化有

  1. 引入一個介面以方便切換到不同的實作
  2. 依賴注入使WebScraper類別更加靈活
  3. fetch_data 方法已變更為以 url 作為參數,使 URL 規格更加靈活。
  4. 業務邏輯已抽象化為 DataEnricherInterface 並實作為 KeywordDataEnricher。
  5. 透過允許使用環境變數設定關鍵字,業務邏輯變得更加靈活。

這些改變大大提高了系統的靈活性和可擴展性。然而,業務邏輯仍然嵌入在 DataEnricherInterface 及其實作中。下一步就是進一步分離這個業務邏輯,並明確定義為領域層。

3.領域層的引進與業務邏輯的分離

上一步中,介面的引入增加了系統的彈性。但是,業務邏輯(在本例中為資料重要性確定和過濾)仍然被視為資料層的一部分。基於領域驅動設計的理念,將此業務邏輯視為系統的中心概念,並將其實現為獨立的領域層,可以帶來以下好處

  1. 業務邏輯集中管理
  2. 透過領域模型更具表現力的程式碼
  3. 更改業務規則具有更大的靈活性
  4. 易於測試

更新的目錄結構:

project_root/
├── domain/
│   ├── __init__.py
│   ├── scraped_data.py
│   └── data_enrichment_service.py
├── data/
│   ├── __init__.py
│   ├── interfaces/
│   │   ├── __init__.py
│   │   ├── data_fetcher_interface.py
│   │   ├── data_parser_interface.py
│   │   └── data_saver_interface.py
│   ├── implementations/
│   │   ├── __init__.py
│   │   ├── http_data_fetcher.py
│   │   ├── json_data_parser.py
│   │   └── sqlite_data_saver.py
├── application/
│   ├── __init__.py
│   └── web_scraper.py
├── main.py
└── requirements.txt

現階段,DataEnricherInterface 和 KeywordDataEnricher 的角色將轉移到領域層的 ScrapedData 模型和 DataEnrichmentService 中。下面提供了此更改的詳細資訊。

更改前(第 2 部分)

class DataEnricherInterface(ABC):
    @abstractmethod
    def enrich(self, data: Dict[str, Any]) -> Dict[str, Any]:
        pass
class KeywordDataEnricher(DataEnricherInterface):
    def __init__(self):
        self.keyword = os.getenv('IMPORTANT_KEYWORD', 'important')

    def enrich(self, data):
        return {k: v for k, v in data.items() if self.keyword in str(v).lower()}

修改後(第 3 部分)

@dataclass
class ScrapedData:
    content: Dict[str, Any]
    source_url: str

    def is_important(self) -> bool:
        important_keyword = os.getenv('IMPORTANT_KEYWORD', 'important')
        return any(important_keyword in str(v).lower() for v in self.content.values())
class DataEnrichmentService:
    def __init__(self):
        self.important_keyword = os.getenv('IMPORTANT_KEYWORD', 'important')

    def enrich(self, data: ScrapedData) -> ScrapedData:
        if data.is_important():
            enriched_content = {k: v for k, v in data.content.items() if self.important_keyword in str(v).lower()}
            return ScrapedData(content=enriched_content, source_url=data.source_url)
        return data

此變更改進了以下內容。

  1. 業務邏輯已移至網域層,消除了對 DataEnricherInterface 的需求。

  2. the KeywordDataEnricher functionality has been merged into the DataEnrichmentService, centralizing the business logic in one place.

  3. The is_important method has been added to the ScrapedData model. This makes the domain model itself responsible for determining the importance of data and makes the domain concept clearer.

  4. DataEnrichmentService now handles ScrapedData objects directly, improving type safety.

The WebScraper class will also be updated to reflect this change.

from data.interfaces.data_fetcher_interface import DataFetcherInterface
from data.interfaces.data_parser_interface import DataParserInterface
from data.interfaces.data_saver_interface import DataSaverInterface
from domain.scraped_data import ScrapedData
from domain.data_enrichment_service import DataEnrichmentService

class WebScraper:
    def __init__(self, fetcher: DataFetcherInterface, parser: DataParserInterface, 
                 saver: DataSaverInterface, enrichment_service: DataEnrichmentService):
        self.fetcher = fetcher
        self.parser = parser
        self.saver = saver
        self.enrichment_service = enrichment_service

    def fetch_data(self, url: str) -> ScrapedData:
        raw_data = self.fetcher.fetch(url)
        parsed_data = self.parser.parse(raw_data)
        scraped_data = ScrapedData(content=parsed_data, source_url=url)
        enriched_data = self.enrichment_service.enrich(scraped_data)
        self.saver.save(enriched_data)
        return enriched_data

This change completely shifts the business logic from the data layer to the domain layer, giving the system a clearer structure. The removal of the DataEnricherInterface and the introduction of the DataEnrichmentService are not just interface replacements, but fundamental changes in the way business logic is handled.

Summary

This article has demonstrated how to improve code quality and apply design principles specifically through a step-by-step refactoring process for the data collection crawler system. The main areas of improvement are as follows.

  1. Separation of Responsibility: Applying the principle of single responsibility, we separated data acquisition, parsing, enrichment, and storage into separate classes.
  2. Introduction of interfaces and dependency injection: greatly increased the flexibility and scalability of the system, making it easier to switch to different implementations.
  3. Introduction of domain model and services: clearly separated the business logic and defined the core concepts of the system.
  4. Adoption of Layered Architecture: Clearly separated the domain, data, and application layers and defined the responsibilities of each layer. 5.Maintain interfaces: Maintained abstraction at the data layer to ensure flexibility in implementation.

These improvements have greatly enhanced the system's modularity, reusability, testability, maintainability, and scalability. In particular, by applying some concepts of domain-driven design, the business logic became clearer and the structure was more flexible to accommodate future changes in requirements. At the same time, by maintaining the interfaces, we ensured the flexibility to easily change and extend the data layer implementation.

It is important to note that this refactoring process is not a one-time event, but part of a continuous improvement process. Depending on the size and complexity of the project, it is important to adopt design principles and DDD concepts at the appropriate level and to make incremental improvements.

Finally, the approach presented in this article can be applied to a wide variety of software projects, not just data collection crawlers. We encourage you to use them as a reference as you work to improve code quality and design.

以上是基於設計原則的重構:資料擷取爬蟲系統範例的詳細內容。更多資訊請關注PHP中文網其他相關文章!

陳述:
本文內容由網友自願投稿,版權歸原作者所有。本站不承擔相應的法律責任。如發現涉嫌抄襲或侵權的內容,請聯絡admin@php.cn