A brief introduction to the Python crawler framework Scrapy-Python Tutorial-php.cn

Home

Backend Development

Python Tutorial

A brief introduction to the Python crawler framework Scrapy

不言

Oct 19, 2018 pm 05:04 PM

python

This article brings you a brief introduction to the Python crawler framework Scrapy. It has certain reference value. Friends in need can refer to it. I hope it will be helpful to you.

Scrapy Framework

Scrapy is an application framework written in pure Python to crawl website data and extract structural data. It has a wide range of uses.

With the power of the framework, users only need to customize and develop a few modules to easily implement a crawler to crawl web content and various images, which is very convenient.

Scrapy uses the Twisted'twɪstɪd asynchronous network framework to handle network communications, which can speed up our downloads without having to implement the asynchronous framework ourselves. It also contains various middleware interfaces and can flexibly complete various needs. .

Scrapy architecture diagram (the green line is the data flow direction):

95625f65089e4bc98a269cfda6701597.png

Scrapy Engine: Responsible for the communication between Spider, ItemPipeline, Downloader, and Scheduler. Signals, data transfer, etc.

Scheduler (scheduler): It is responsible for accepting Request requests sent by the engine, sorting them out in a certain way, entering them into the queue, and returning them to the engine when the engine needs them.

Downloader (Downloader): Responsible for downloading all Requests sent by Scrapy Engine (Engine), and returning the obtained Responses to Scrapy Engine (Engine), which is handed over to Spider for processing,

Spider (crawler): It is responsible for processing all Responses, analyzing and extracting data, obtaining the data required by the Item field, and submitting the URL that needs to be followed to the engine, and then entering the Scheduler (scheduler) again,

Item Pipeline (pipeline): It is responsible for processing the Item obtained from the Spider and performing post-processing (detailed analysis, filtering, storage, etc.).

Downloader Middlewares (download middleware): You It can be regarded as a component that can be customized to extend the download function.

Spider Middlewares (Spider middleware): You can understand it as a functional component that can customize the extension and operation engine and the middle communication between the Spider (such as Responses entering the Spider; and Requests out of the Spider)

b847d7fa404a404ca0a656028ada63b5.png

If you encounter many questions and problems in the process of learning Python, you can add -q-u-n 227 -435-450 There are software video materials for free

Scrapy The operation process

The code is written and the program starts to run...

Engine: Hi! Spider, which website are you working on?

Spider: The boss wants me to handle xxxx.com.

Engine: Give me the first URL that needs to be processed.

Spider: Here you go, the first URL is xxxxxxx.com.

Engine: Hi! Scheduler, I have a request here to ask you to help me sort the queues.

Scheduler: OK, processing. Please wait.

Engine: Hi! Scheduler, give me the request you processed.

Scheduler: Here you go, this is the request I have processed

Engine: Hi! Downloader, please help me download this request according to the boss's download middleware settings. Request

Downloader: OK! Here you go, here’s the download. (If it fails: sorry, the download of this request failed. Then the engine tells the scheduler that the download of this request failed. You record it and we will download it later)

Engine: Hi! Spider, this is something that has been downloaded and has been processed according to the boss's download middleware. You can handle it yourself (note! The responses here are handled by the def parse() function by default)

Spider : (for the URL that needs to be followed up after the data is processed), Hi! Engine, I have two results here, this is the URL I need to follow up, and this is the Item data I obtained.

Engine: Hi! Pipeline I have an item here. Please help me deal with it! scheduler! This is a URL that needs to be followed. Please help me deal with it. Then start the loop from step 4 until you have obtained all the information the boss needs.

Pipeline `` Scheduler: OK, do it now!

Notice! Only when there are no requests in the scheduler, the entire program will stop (that is, Scrapy will also re-download the URL that failed to download.)

There are 4 steps required to make a Scrapy crawler:

New project (scrapy startproject xxx): Create a new crawler project

Clear the goal (write items.py): Clear the goal you want to crawl

Make a crawler (spiders/xxspider.py): Make a crawler to start crawling web pages

Storage content (pipelines.py): Design pipelines to store crawled content

The above is the detailed content of A brief introduction to the Python crawler framework Scrapy. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:segmentfault思否. If there is any infringement, please contact admin@php.cn delete

Merging Lists in Python: Choosing the Right MethodMay 14, 2025 am 12:11 AM

TomergelistsinPython,youcanusethe operator,extendmethod,listcomprehension,oritertools.chain,eachwithspecificadvantages:1)The operatorissimplebutlessefficientforlargelists;2)extendismemory-efficientbutmodifiestheoriginallist;3)listcomprehensionoffersf

How to concatenate two lists in python 3?May 14, 2025 am 12:09 AM

In Python 3, two lists can be connected through a variety of methods: 1) Use operator, which is suitable for small lists, but is inefficient for large lists; 2) Use extend method, which is suitable for large lists, with high memory efficiency, but will modify the original list; 3) Use * operator, which is suitable for merging multiple lists, without modifying the original list; 4) Use itertools.chain, which is suitable for large data sets, with high memory efficiency.

Python concatenate list stringsMay 14, 2025 am 12:08 AM

Using the join() method is the most efficient way to connect strings from lists in Python. 1) Use the join() method to be efficient and easy to read. 2) The cycle uses operators inefficiently for large lists. 3) The combination of list comprehension and join() is suitable for scenarios that require conversion. 4) The reduce() method is suitable for other types of reductions, but is inefficient for string concatenation. The complete sentence ends.

Python execution, what is that?May 14, 2025 am 12:06 AM

PythonexecutionistheprocessoftransformingPythoncodeintoexecutableinstructions.1)Theinterpreterreadsthecode,convertingitintobytecode,whichthePythonVirtualMachine(PVM)executes.2)TheGlobalInterpreterLock(GIL)managesthreadexecution,potentiallylimitingmul

Python: what are the key featuresMay 14, 2025 am 12:02 AM

Key features of Python include: 1. The syntax is concise and easy to understand, suitable for beginners; 2. Dynamic type system, improving development speed; 3. Rich standard library, supporting multiple tasks; 4. Strong community and ecosystem, providing extensive support; 5. Interpretation, suitable for scripting and rapid prototyping; 6. Multi-paradigm support, suitable for various programming styles.

Python: compiler or Interpreter?May 13, 2025 am 12:10 AM

Python is an interpreted language, but it also includes the compilation process. 1) Python code is first compiled into bytecode. 2) Bytecode is interpreted and executed by Python virtual machine. 3) This hybrid mechanism makes Python both flexible and efficient, but not as fast as a fully compiled language.

Python For Loop vs While Loop: When to Use Which?May 13, 2025 am 12:07 AM

Useaforloopwheniteratingoverasequenceorforaspecificnumberoftimes;useawhileloopwhencontinuinguntilaconditionismet.Forloopsareidealforknownsequences,whilewhileloopssuitsituationswithundeterminediterations.

Python loops: The most common errorsMay 13, 2025 am 12:07 AM

Pythonloopscanleadtoerrorslikeinfiniteloops,modifyinglistsduringiteration,off-by-oneerrors,zero-indexingissues,andnestedloopinefficiencies.Toavoidthese:1)Use'i

See all articles