Home  >  Article  >  Backend Development  >  A brief introduction to the Python crawler framework Scrapy

A brief introduction to the Python crawler framework Scrapy

不言
不言forward
2018-10-19 17:04:042621browse

This article brings you a brief introduction to the Python crawler framework Scrapy. It has certain reference value. Friends in need can refer to it. I hope it will be helpful to you.

Scrapy Framework

Scrapy is an application framework written in pure Python to crawl website data and extract structural data. It has a wide range of uses.

With the power of the framework, users only need to customize and develop a few modules to easily implement a crawler to crawl web content and various images, which is very convenient.

Scrapy uses the Twisted'twɪstɪd asynchronous network framework to handle network communications, which can speed up our downloads without having to implement the asynchronous framework ourselves. It also contains various middleware interfaces and can flexibly complete various needs. .

Scrapy architecture diagram (the green line is the data flow direction):

95625f65089e4bc98a269cfda6701597.png

Scrapy Engine: Responsible for the communication between Spider, ItemPipeline, Downloader, and Scheduler. Signals, data transfer, etc.

Scheduler (scheduler): It is responsible for accepting Request requests sent by the engine, sorting them out in a certain way, entering them into the queue, and returning them to the engine when the engine needs them.

Downloader (Downloader): Responsible for downloading all Requests sent by Scrapy Engine (Engine), and returning the obtained Responses to Scrapy Engine (Engine), which is handed over to Spider for processing,

Spider (crawler): It is responsible for processing all Responses, analyzing and extracting data, obtaining the data required by the Item field, and submitting the URL that needs to be followed to the engine, and then entering the Scheduler (scheduler) again,

Item Pipeline (pipeline): It is responsible for processing the Item obtained from the Spider and performing post-processing (detailed analysis, filtering, storage, etc.).

Downloader Middlewares (download middleware): You It can be regarded as a component that can be customized to extend the download function.

Spider Middlewares (Spider middleware): You can understand it as a functional component that can customize the extension and operation engine and the middle communication between the Spider (such as Responses entering the Spider; and Requests out of the Spider)

b847d7fa404a404ca0a656028ada63b5.png

If you encounter many questions and problems in the process of learning Python, you can add -q-u-n 227 -435-450 There are software video materials for free

Scrapy The operation process

The code is written and the program starts to run...

Engine: Hi! Spider, which website are you working on?

Spider: The boss wants me to handle xxxx.com.

Engine: Give me the first URL that needs to be processed.

Spider: Here you go, the first URL is xxxxxxx.com.

Engine: Hi! Scheduler, I have a request here to ask you to help me sort the queues.

Scheduler: OK, processing. Please wait.

Engine: Hi! Scheduler, give me the request you processed.

Scheduler: Here you go, this is the request I have processed

Engine: Hi! Downloader, please help me download this request according to the boss's download middleware settings. Request

Downloader: OK! Here you go, here’s the download. (If it fails: sorry, the download of this request failed. Then the engine tells the scheduler that the download of this request failed. You record it and we will download it later)

Engine: Hi! Spider, this is something that has been downloaded and has been processed according to the boss's download middleware. You can handle it yourself (note! The responses here are handled by the def parse() function by default)

Spider : (for the URL that needs to be followed up after the data is processed), Hi! Engine, I have two results here, this is the URL I need to follow up, and this is the Item data I obtained.

Engine: Hi! Pipeline I have an item here. Please help me deal with it! scheduler! This is a URL that needs to be followed. Please help me deal with it. Then start the loop from step 4 until you have obtained all the information the boss needs.

Pipeline `` Scheduler: OK, do it now!

Notice! Only when there are no requests in the scheduler, the entire program will stop (that is, Scrapy will also re-download the URL that failed to download.)

There are 4 steps required to make a Scrapy crawler:

New project (scrapy startproject xxx): Create a new crawler project

Clear the goal (write items.py): Clear the goal you want to crawl

Make a crawler (spiders/xxspider.py): Make a crawler to start crawling web pages

Storage content (pipelines.py): Design pipelines to store crawled content

The above is the detailed content of A brief introduction to the Python crawler framework Scrapy. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:segmentfault.com. If there is any infringement, please contact admin@php.cn delete