Home > Article > Backend Development > Refactoring based on design principles: example of a data collection crawler system
Improving code quality is always an important issue in software development. In this article, we take a data collection crawler system as an example and explain specifically how to apply design principles and best practices through step-by-step refactoring.
First, we start with a very simple web scraper with all functionality integrated into one class.
Translated with DeepL.com (free version)
project_root/ ├── web_scraper.py ├── main.py └── requirements.txt
web_scraper.py
import requests import json import sqlite3 class WebScraper: def __init__(self, url): self.url = url def fetch_data(self): response = requests.get(self.url) data = response.text parsed_data = self.parse_data(data) enriched_data = self.enrich_data(parsed_data) self.save_data(enriched_data) return enriched_data def parse_data(self, data): return json.loads(data) def enrich_data(self, data): # Apply business logic here # Example: extract only data containing specific keywords return {k: v for k, v in data.items() if 'important' in v.lower()} def save_data(self, data): conn = sqlite3.connect('test.db') cursor = conn.cursor() cursor.execute('INSERT INTO data (json_data) VALUES (?)', (json.dumps(data),)) conn.commit() conn.close()
main.py
from web_scraper import WebScraper def main(): scraper = WebScraper('https://example.com/api/data') data = scraper.fetch_data() print(data) if __name__ == "__main__": main()
directory structure
project_root/ ├── data_fetcher.py ├── data_parser.py ├── data_saver.py ├── data_enricher.py ├── web_scraper.py ├── main.py └── requirements.txt
data_enricher.py
class DataEnricher: def enrich(self, data): return {k: v for k, v in data.items() if 'important' in v.lower()}
web_scraper.py
from data_fetcher import DataFetcher from data_parser import DataParser from data_enricher import DataEnricher from data_saver import DataSaver class WebScraper: def __init__(self, url): self.url = url self.fetcher = DataFetcher() self.parser = DataParser() self.enricher = DataEnricher() self.saver = DataSaver() def fetch_data(self): raw_data = self.fetcher.fetch(self.url) parsed_data = self.parser.parse(raw_data) enriched_data = self.enricher.enrich(parsed_data) self.saver.save(enriched_data) return enriched_data
This change clarifies the responsibilities of each class and improves reusability and testability. However, the business logic is still embedded in the DataEnricher class.
directory structure
project_root/ ├── interfaces/ │ ├── __init__.py │ ├── data_fetcher_interface.py │ ├── data_parser_interface.py │ ├── data_enricher_interface.py │ └── data_saver_interface.py ├── implementations/ │ ├── __init__.py │ ├── http_data_fetcher.py │ ├── json_data_parser.py │ ├── keyword_data_enricher.py │ └── sqlite_data_saver.py ├── web_scraper.py ├── main.py └── requirements.txt
interfaces/data_fetcher_interface.py
from abc import ABC, abstractmethod class DataFetcherInterface(ABC): @abstractmethod def fetch(self, url: str) -> str: pass
interfaces/data_parser_interface.py
from abc import ABC, abstractmethod from typing import Dict, Any class DataParserInterface(ABC): @abstractmethod def parse(self, raw_data: str) -> Dict[str, Any]: pass
interfaces/data_enricher_interface.py
from abc import ABC, abstractmethod from typing import Dict, Any class DataEnricherInterface(ABC): @abstractmethod def enrich(self, data: Dict[str, Any]) -> Dict[str, Any]: pass
interfaces/data_saver_interface.py
from abc import ABC, abstractmethod from typing import Dict, Any class DataSaverInterface(ABC): @abstractmethod def save(self, data: Dict[str, Any]) -> None: pass
implementations/keyword_data_enricher.py
import os from interfaces.data_enricher_interface import DataEnricherInterface class KeywordDataEnricher(DataEnricherInterface): def __init__(self): self.keyword = os.getenv('IMPORTANT_KEYWORD', 'important') def enrich(self, data): return {k: v for k, v in data.items() if self.keyword in str(v).lower()}
web_scraper.py
from interfaces.data_fetcher_interface import DataFetcherInterface from interfaces.data_parser_interface import DataParserInterface from interfaces.data_enricher_interface import DataEnricherInterface from interfaces.data_saver_interface import DataSaverInterface class WebScraper: def __init__(self, fetcher: DataFetcherInterface, parser: DataParserInterface, enricher: DataEnricherInterface, saver: DataSaverInterface): self.fetcher = fetcher self.parser = parser self.enricher = enricher self.saver = saver def fetch_data(self, url): raw_data = self.fetcher.fetch(url) parsed_data = self.parser.parse(raw_data) enriched_data = self.enricher.enrich(parsed_data) self.saver.save(enriched_data) return enriched_data
The main changes at this stage are
These changes have greatly improved the flexibility and extensibility of the system. However, the business logic remains embedded in the DataEnricherInterface and its implementation. The next step is to further separate this business logic and clearly define it as a domain layer.
In the previous step, the introduction of interfaces increased the flexibility of the system. However, the business logic (in this case, data importance determination and filtering) is still treated as part of the data layer. Based on the concept of domain-driven design, treating this business logic as the central concept of the system and implementing it as an independent domain layer provides the following benefits
Updated directory structure:
project_root/ ├── domain/ │ ├── __init__.py │ ├── scraped_data.py │ └── data_enrichment_service.py ├── data/ │ ├── __init__.py │ ├── interfaces/ │ │ ├── __init__.py │ │ ├── data_fetcher_interface.py │ │ ├── data_parser_interface.py │ │ └── data_saver_interface.py │ ├── implementations/ │ │ ├── __init__.py │ │ ├── http_data_fetcher.py │ │ ├── json_data_parser.py │ │ └── sqlite_data_saver.py ├── application/ │ ├── __init__.py │ └── web_scraper.py ├── main.py └── requirements.txt
At this stage, the roles of DataEnricherInterface and KeywordDataEnricher will be moved to the ScrapedData model and DataEnrichmentService at the domain layer. Details of this change are provided below.
Before change (Section 2)
class DataEnricherInterface(ABC): @abstractmethod def enrich(self, data: Dict[str, Any]) -> Dict[str, Any]: pass
class KeywordDataEnricher(DataEnricherInterface): def __init__(self): self.keyword = os.getenv('IMPORTANT_KEYWORD', 'important') def enrich(self, data): return {k: v for k, v in data.items() if self.keyword in str(v).lower()}
After modification (Section 3)
@dataclass class ScrapedData: content: Dict[str, Any] source_url: str def is_important(self) -> bool: important_keyword = os.getenv('IMPORTANT_KEYWORD', 'important') return any(important_keyword in str(v).lower() for v in self.content.values())
class DataEnrichmentService: def __init__(self): self.important_keyword = os.getenv('IMPORTANT_KEYWORD', 'important') def enrich(self, data: ScrapedData) -> ScrapedData: if data.is_important(): enriched_content = {k: v for k, v in data.content.items() if self.important_keyword in str(v).lower()} return ScrapedData(content=enriched_content, source_url=data.source_url) return data
This change improves the following.
business logic has been moved to the domain layer, eliminating the need for a DataEnricherInterface.
the KeywordDataEnricher functionality has been merged into the DataEnrichmentService, centralizing the business logic in one place.
The is_important method has been added to the ScrapedData model. This makes the domain model itself responsible for determining the importance of data and makes the domain concept clearer.
DataEnrichmentService now handles ScrapedData objects directly, improving type safety.
The WebScraper class will also be updated to reflect this change.
from data.interfaces.data_fetcher_interface import DataFetcherInterface from data.interfaces.data_parser_interface import DataParserInterface from data.interfaces.data_saver_interface import DataSaverInterface from domain.scraped_data import ScrapedData from domain.data_enrichment_service import DataEnrichmentService class WebScraper: def __init__(self, fetcher: DataFetcherInterface, parser: DataParserInterface, saver: DataSaverInterface, enrichment_service: DataEnrichmentService): self.fetcher = fetcher self.parser = parser self.saver = saver self.enrichment_service = enrichment_service def fetch_data(self, url: str) -> ScrapedData: raw_data = self.fetcher.fetch(url) parsed_data = self.parser.parse(raw_data) scraped_data = ScrapedData(content=parsed_data, source_url=url) enriched_data = self.enrichment_service.enrich(scraped_data) self.saver.save(enriched_data) return enriched_data
This change completely shifts the business logic from the data layer to the domain layer, giving the system a clearer structure. The removal of the DataEnricherInterface and the introduction of the DataEnrichmentService are not just interface replacements, but fundamental changes in the way business logic is handled.
This article has demonstrated how to improve code quality and apply design principles specifically through a step-by-step refactoring process for the data collection crawler system. The main areas of improvement are as follows.
These improvements have greatly enhanced the system's modularity, reusability, testability, maintainability, and scalability. In particular, by applying some concepts of domain-driven design, the business logic became clearer and the structure was more flexible to accommodate future changes in requirements. At the same time, by maintaining the interfaces, we ensured the flexibility to easily change and extend the data layer implementation.
It is important to note that this refactoring process is not a one-time event, but part of a continuous improvement process. Depending on the size and complexity of the project, it is important to adopt design principles and DDD concepts at the appropriate level and to make incremental improvements.
Finally, the approach presented in this article can be applied to a wide variety of software projects, not just data collection crawlers. We encourage you to use them as a reference as you work to improve code quality and design.
The above is the detailed content of Refactoring based on design principles: example of a data collection crawler system. For more information, please follow other related articles on the PHP Chinese website!