Home >Backend Development >Python Tutorial >Scrapy underlying architecture exploration and source code analysis
Scrapy underlying architecture exploration and source code analysis
Scrapy is an efficient web crawler framework based on Python. It can quickly and easily extract data from web pages and supports diversified data storage and export formats. It has become a popular choice among many The preferred framework for crawler enthusiasts and developers. Scrapy uses an asynchronous I/O model and middleware mechanism in its underlying implementation, making it more efficient and more scalable. In this article, we will explore the implementation of Scrapy from both the underlying architecture and source code analysis.
1. Scrapy’s underlying architecture
The underlying architecture of Scrapy is mainly divided into Engine, Scheduler, Downloader, Spider and Pipeline ) and five other modules. They perform their own duties in Scrapy and work together to make the entire crawling process smooth and efficient.
As the core of the entire crawler framework, Scrapy’s engine is responsible for coordinating the interaction between various modules and processing events and events between each module. Signal. When the engine receives the crawler start signal, it will obtain a Request object to be crawled in the scheduler, and then send the object to the downloader for downloading. After the downloader completes the download, it will send the returned Response object to the engine. The engine The Response object will first be handed over to Spider for analysis, and a new Request object will be generated based on the results returned by Spider, and then the new Request object will be sent to the scheduler. This process will be executed in a loop until the scheduler queue is empty, and the crawler will not end.
2. Scheduler (Scheduler)
The scheduler is used to store and manage all Request objects waiting to be crawled, and is responsible for arranging the order of their requests. It will deduplicate the Request object and remove lower priority requests. When the engine needs to obtain the next Request object to be crawled, it will call the scheduler's method to obtain it. Whenever a Request object is downloaded, the downloader will hand the returned Response object and the corresponding Request object to the scheduler, and the scheduler will store them in the request queue.
3. Downloader (Downloader)
The downloader is mainly used to convert the Request object passed by the engine into a Response object and return it to the engine. The downloader will obtain the corresponding web page content from the specified URL address by sending an HTTP or HTTPS request. The downloader also provides some downloading middleware, and you can add some customized processing during the downloading process, such as proxy, UA identification, cookie processing, etc.
4. Spider (Spider)
The crawler module is the actual crawling logic. It is mainly responsible for parsing the downloaded web page content, encapsulating the parsed results into Item objects, and then returning the Item objects. to the engine. The crawler module usually requires writing a custom crawler class and rewriting some methods to perform page parsing, encapsulation of Item objects, and generation of Request objects.
5. Pipeline
The pipeline module is used to perform a series of processing on the Item objects returned by Spider, such as data cleaning, deduplication, and storage in databases or files. In Scrapy, you can write multiple pipeline classes and form a pipeline chain in order of priority. When the Spider returns the Item object, these pipelines will be processed in order.
2. Scrapy’s source code analysis
The Spider class is the most core class in Scrapy, it is all custom crawler classes The base class contains the main implementation methods of the crawling process.
First, we need to define some attributes in our crawler class, such as name, allowed_domains, start_urls, etc. These attributes are used to specify the name of the crawler, the domain names that are allowed to be crawled, and the URL address from which crawling begins.
By overriding the start_requests() method, we can generate the first batch of requests and hand them over to the engine for crawling.
Next, we need to define the parse() method, which is mainly used to parse the downloaded web page content, including extracting data, generating new Request objects, etc. The parse() method will be called by the engine. In this method, the web page will be parsed step by step, and finally an Item object or a new Request object will be returned.
In Scrapy, the Item class is used to encapsulate data extracted from web pages. It is actually a dictionary object. Various types of data fields can be provided in the Item object, and simple data processing logic can be implemented in the crawler, such as data cleaning, data mapping, etc. The Item object will eventually be returned to the engine and processed by the Pipeline in turn.
The Settings module is used to set Scrapy configuration information, including the name of the crawler, request delay, number of concurrencies, download timeout, etc. You can change the way Scrapy runs by modifying the options in the Settings module. In the crawler code, the Settings module can be accessed through the settings attribute in the crawler class. All options in the Settings module are saved in the form of a dictionary. We can modify the option values in the Settings module directly in the code, or read the configuration information from the file.
The download middleware in Scrapy can intercept requests initiated by the downloader and responses received, and can modify the request or response and add a proxy. , UA logo, Cookie, etc. Scrapy supports multiple middlewares, which can be executed sequentially in order of priority. Middleware intercepts and processes by overriding the process_request(), process_response() or process_exception() methods.
Spider middleware is used to intercept Spider's input and output. It includes Downloader Middleware that intercepts requests and Spider that intercepts responses. Middleware has two parts. Middleware intercepts and processes by overriding the process_spider_input() and process_spider_output() methods.
The Scrapy framework is very powerful and can be adapted to a variety of websites. It provides rich functions and extended interfaces, and is very suitable for large-scale, efficient and stable web data crawling. But at the same time, Scrapy also has some shortcomings of its own, such as insufficient support for JavaScript-rendered website crawling, insufficient support for AJAX real-time requests, etc. These need to be combined with other tools to work together.
The above is the detailed content of Scrapy underlying architecture exploration and source code analysis. For more information, please follow other related articles on the PHP Chinese website!