search
HomeBackend DevelopmentPython TutorialA brief introduction to the Python crawler framework Scrapy

This article brings you a brief introduction to the Python crawler framework Scrapy. It has certain reference value. Friends in need can refer to it. I hope it will be helpful to you.

Scrapy Framework

Scrapy is an application framework written in pure Python to crawl website data and extract structural data. It has a wide range of uses.

With the power of the framework, users only need to customize and develop a few modules to easily implement a crawler to crawl web content and various images, which is very convenient.

Scrapy uses the Twisted'twɪstɪd asynchronous network framework to handle network communications, which can speed up our downloads without having to implement the asynchronous framework ourselves. It also contains various middleware interfaces and can flexibly complete various needs. .

Scrapy architecture diagram (the green line is the data flow direction):

95625f65089e4bc98a269cfda6701597.png

Scrapy Engine: Responsible for the communication between Spider, ItemPipeline, Downloader, and Scheduler. Signals, data transfer, etc.

Scheduler (scheduler): It is responsible for accepting Request requests sent by the engine, sorting them out in a certain way, entering them into the queue, and returning them to the engine when the engine needs them.

Downloader (Downloader): Responsible for downloading all Requests sent by Scrapy Engine (Engine), and returning the obtained Responses to Scrapy Engine (Engine), which is handed over to Spider for processing,

Spider (crawler): It is responsible for processing all Responses, analyzing and extracting data, obtaining the data required by the Item field, and submitting the URL that needs to be followed to the engine, and then entering the Scheduler (scheduler) again,

Item Pipeline (pipeline): It is responsible for processing the Item obtained from the Spider and performing post-processing (detailed analysis, filtering, storage, etc.).

Downloader Middlewares (download middleware): You It can be regarded as a component that can be customized to extend the download function.

Spider Middlewares (Spider middleware): You can understand it as a functional component that can customize the extension and operation engine and the middle communication between the Spider (such as Responses entering the Spider; and Requests out of the Spider)

b847d7fa404a404ca0a656028ada63b5.png

If you encounter many questions and problems in the process of learning Python, you can add -q-u-n 227 -435-450 There are software video materials for free

Scrapy The operation process

The code is written and the program starts to run...

Engine: Hi! Spider, which website are you working on?

Spider: The boss wants me to handle xxxx.com.

Engine: Give me the first URL that needs to be processed.

Spider: Here you go, the first URL is xxxxxxx.com.

Engine: Hi! Scheduler, I have a request here to ask you to help me sort the queues.

Scheduler: OK, processing. Please wait.

Engine: Hi! Scheduler, give me the request you processed.

Scheduler: Here you go, this is the request I have processed

Engine: Hi! Downloader, please help me download this request according to the boss's download middleware settings. Request

Downloader: OK! Here you go, here’s the download. (If it fails: sorry, the download of this request failed. Then the engine tells the scheduler that the download of this request failed. You record it and we will download it later)

Engine: Hi! Spider, this is something that has been downloaded and has been processed according to the boss's download middleware. You can handle it yourself (note! The responses here are handled by the def parse() function by default)

Spider : (for the URL that needs to be followed up after the data is processed), Hi! Engine, I have two results here, this is the URL I need to follow up, and this is the Item data I obtained.

Engine: Hi! Pipeline I have an item here. Please help me deal with it! scheduler! This is a URL that needs to be followed. Please help me deal with it. Then start the loop from step 4 until you have obtained all the information the boss needs.

Pipeline `` Scheduler: OK, do it now!

Notice! Only when there are no requests in the scheduler, the entire program will stop (that is, Scrapy will also re-download the URL that failed to download.)

There are 4 steps required to make a Scrapy crawler:

New project (scrapy startproject xxx): Create a new crawler project

Clear the goal (write items.py): Clear the goal you want to crawl

Make a crawler (spiders/xxspider.py): Make a crawler to start crawling web pages

Storage content (pipelines.py): Design pipelines to store crawled content

The above is the detailed content of A brief introduction to the Python crawler framework Scrapy. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:segmentfault思否. If there is any infringement, please contact admin@php.cn delete
Python: Games, GUIs, and MorePython: Games, GUIs, and MoreApr 13, 2025 am 12:14 AM

Python excels in gaming and GUI development. 1) Game development uses Pygame, providing drawing, audio and other functions, which are suitable for creating 2D games. 2) GUI development can choose Tkinter or PyQt. Tkinter is simple and easy to use, PyQt has rich functions and is suitable for professional development.

Python vs. C  : Applications and Use Cases ComparedPython vs. C : Applications and Use Cases ComparedApr 12, 2025 am 12:01 AM

Python is suitable for data science, web development and automation tasks, while C is suitable for system programming, game development and embedded systems. Python is known for its simplicity and powerful ecosystem, while C is known for its high performance and underlying control capabilities.

The 2-Hour Python Plan: A Realistic ApproachThe 2-Hour Python Plan: A Realistic ApproachApr 11, 2025 am 12:04 AM

You can learn basic programming concepts and skills of Python within 2 hours. 1. Learn variables and data types, 2. Master control flow (conditional statements and loops), 3. Understand the definition and use of functions, 4. Quickly get started with Python programming through simple examples and code snippets.

Python: Exploring Its Primary ApplicationsPython: Exploring Its Primary ApplicationsApr 10, 2025 am 09:41 AM

Python is widely used in the fields of web development, data science, machine learning, automation and scripting. 1) In web development, Django and Flask frameworks simplify the development process. 2) In the fields of data science and machine learning, NumPy, Pandas, Scikit-learn and TensorFlow libraries provide strong support. 3) In terms of automation and scripting, Python is suitable for tasks such as automated testing and system management.

How Much Python Can You Learn in 2 Hours?How Much Python Can You Learn in 2 Hours?Apr 09, 2025 pm 04:33 PM

You can learn the basics of Python within two hours. 1. Learn variables and data types, 2. Master control structures such as if statements and loops, 3. Understand the definition and use of functions. These will help you start writing simple Python programs.

How to teach computer novice programming basics in project and problem-driven methods within 10 hours?How to teach computer novice programming basics in project and problem-driven methods within 10 hours?Apr 02, 2025 am 07:18 AM

How to teach computer novice programming basics within 10 hours? If you only have 10 hours to teach computer novice some programming knowledge, what would you choose to teach...

How to avoid being detected by the browser when using Fiddler Everywhere for man-in-the-middle reading?How to avoid being detected by the browser when using Fiddler Everywhere for man-in-the-middle reading?Apr 02, 2025 am 07:15 AM

How to avoid being detected when using FiddlerEverywhere for man-in-the-middle readings When you use FiddlerEverywhere...

What should I do if the '__builtin__' module is not found when loading the Pickle file in Python 3.6?What should I do if the '__builtin__' module is not found when loading the Pickle file in Python 3.6?Apr 02, 2025 am 07:12 AM

Error loading Pickle file in Python 3.6 environment: ModuleNotFoundError:Nomodulenamed...

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. How to Fix Audio if You Can't Hear Anyone
3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
WWE 2K25: How To Unlock Everything In MyRise
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

SAP NetWeaver Server Adapter for Eclipse

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

SecLists

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

MantisBT

MantisBT

Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.