Tutorial on using Python crawler framework Scrapy-Python Tutorial-php.cn

Home

Backend Development

Python Tutorial

Tutorial on using Python crawler framework Scrapy

不言

Oct 19, 2018 pm 04:02 PM

python

This article brings you a tutorial on using the Python crawler framework Scrapy. It has certain reference value. Friends in need can refer to it. I hope it will be helpful to you.

Hello everyone, in this article we will take a look at the powerful Python crawler framework Scrapy. Scrapy is a simple-to-use, powerful asynchronous crawler framework. Let’s take a look at its installation first.

Installation of Scrapy

The installation of Scrapy is very troublesome. For some people who want to use Scrapy, its installation often makes many people die halfway. Here I will share with you my installation process and the installation methods compiled on the Internet. I hope everyone can install smoothly.

Windows installation

Before we begin, we must make sure that we have installed Python. In this article, we take Python3.5 as an example. Scrapy has many dependent packages, let’s install them one by one.

First, use pip -v to check whether pip is installed normally. If it is normal, then we proceed to the next step;

pip install wheel is a package we have introduced in our previous article. After installing it we You can install some wheel software;

lxml installation. The previous article mentioned its installation, so let’s reorganize it here. whl file address: here. Find the file of your corresponding version. After downloading, find the file location, right-click the file properties, click the security label, and copy its path. Open the administrator tool (cmd), pip install ;

PyOpenssl whl file address: here. Click to download, the whl file installation method is the same as above;

Twisted framework This framework is an asynchronous network library and is the core of Scrapy. whl file address: here;

Pywin32 This is a Pywin32 compatible library, download address: here, select the version to download;

If all the above libraries are installed, then we will Our Scrapy can be installed. Isn't pip install scrapy

very troublesome? If you don't like to bother, you can also install it very conveniently under Windows. Then we have to use Anaconda we mentioned before. You can find the specific installation yourself, or find it in previous articles. Then he only needs one line to install Scrapy:

conda install scrapy

Linux installation

Linux system installation is simpler:

sudo apt- get install build-essential python3-dev libssl-dev libffi-dev libxml2 libxml2-dev libxslt1-dev zlib1g-dev

Mac OS installation

We need to install some C dependent libraries first, xcode -select --install

You need to install the command line development tools, we click to install. Once the installation is complete, the dependent libraries are also installed.

Then we directly use pip to install pip install scrapy

Above, the installation of our Scrapy library is basically solved.

Basic use of Scrapy

Chinese document address of Scrapy: here

Scrapy is an application framework written to crawl website data and extract structural data. It can be used in a series of programs including data mining, information processing or storing historical data.

His basic project process is:

Create a Scrapy project

Define the extracted Item

Write a spider to crawl the website and extract the Item

Write Item Pipeline to store the extracted Item (i.e. data)

Generally our crawler process is:

Catch the index page: request the URL of the index page and get the source code, perform the next step of analysis;

Get the content and next page link: analyze the source code, extract the index page data, and get the next page link, perform the next step of crawling;

Turn over Page crawling: request the next page information, analyze the content and request links on the next page;

Save crawling results: save the crawling results in a specific format and text, or save the database.

Let’s see how to use it step by step.

Create Project

Before you start scraping, you must create a new Scrapy project. Enter the directory where you plan to store the code and run the following command (take Zhihu Daily as an example):

scrapy startproject zhihurb

This command will create a zhihu directory containing the following content:

zhihurb/

scrapy.cfg

zhihurb/

    __init__.py

    items.py

    pipelines.py

    settings.py

    spiders/

        __init__.py

        ...

These files are:

scrapy.cfg: The project’s configuration file zhihurb/: The python module of the project. You will add the code here later. zhihurb/items.py: The item files in the project.zhihurb/pipelines.py: The pipelines files in the project.zhihurb/settings.py: The setting files of the project.zhihurb/spiders/: The directory where the spider code is placed.

Define Item

This step is to define the data information we need to obtain, such as we need to obtain some URLs in the website, the content of the website article, the author of the article, etc. The place where this step is defined is in our items.py file.

import scrapy

class ZhihuItem(scrapy.Item):

name = scrapy.Field()

article = scrapy.Field()

Writing Spider

This step is to write the crawler we are most familiar with, and we The Scrapy framework allows us not to think about the implementation method, but only needs to write the crawling logic.

First we need to create our crawler file in the spiders/ folder, for example, it is called spider.py. Before writing a crawler, we need to define some content first. Let’s take Zhihu Daily as an example: https://daily.zhihu.com/

from scrapy import Spider

class ZhihuSpider(Spider):

name = "zhihu"

allowed_domains = ["zhihu.com"]

start_urls = ['https://daily.zhihu.com/']

这里我们定义了什么呢?首先我们导入了Scrapy的Spider组件。然后创建一个爬虫类，在类里我们定义了我们的爬虫名称：zhihu（注意：爬虫名称独一无二的，是不可以和别的爬虫重复的）。还定义了一个网址范围，和一个起始 url 列表，说明起始 url 可以是多个。

然后我们定义一个解析函数：

def parse(self, response):

print(response.text)

我们直接打印一下，看看这个解析函数是什么。

运行爬虫

scrapy crawl zhihu

由于Scrapy是不支持在IDE中执行，所以我们必须在命令行里执行命令，我们要确定是不是cd到爬虫目录下。然后执行，这里的命令顾名思义，crawl是蜘蛛的意思，zhihu就是我们定义的爬虫名称了。

查看输出，我们先看到的是一些爬虫类的输出，可以看到输出的log中包含定义在 start_urls 的初始URL，并且与spider中是一一对应的。我们接着可以看到打印出了网页源代码。可是我们似乎并没有做什么，就得到了网页的源码，这是Scrapy比较方便的一点。

提取数据

接着就可以使用解析工具解析源码，拿到数据了。

由于Scrapy内置了CSS和xpath选择器，而我们虽然可以使用Beautifulsoup，但是BeautifulSoup的缺点就是慢，这不符合我们Scrapy的风格，所有我还是建议大家使用CSS或者Xpath。

由于之前我并没有写过关于Xpath或者CSS选择器的用法，那么首先这个并不难，而且熟悉浏览器的用法，可以很简单的掌握他们。

我们以提取知乎日报里的文章url为例：

from scrapy import Request

def parse(self, response):

urls = response.xpath('//p[@class="box"]/a/@href').extract()

for url in urls:

    yield Request(url, callback=self.parse_url)

这里我们使用xpath解析出所有的url（extract()是获得所有URL集合，extract_first()是获得第一个）。然后将url利用yield语法糖，回调函数给下一个解析url的函数。

使用item

后面详细的组件使用留在下一章讲解，这里假如我们解析出了文章内容和标题，我们要将提取的数据保存到item容器。

Item对象相当于是自定义的python字典。您可以使用标准的字典语法来获取到其每个字段的值。(字段即是我们之前用Field赋值的属性)。

假如我们下一个解析函数解析出了数据

def parse_url(self, response):

# name = xxxx

# article = xxxx

# 保存

item = DmozItem()

item['name'] = name

item['article'] = article

# 返回item

yield item

保存爬取到的数据

这里我们需要在管道文件pipelines.py里去操作数据，比如我们要将这些数据的文章标题只保留 5 个字，然后保存在文本里。或者我们要将数据保存到数据库里，这些都是在管道文件里面操作。我们后面在详细讲解。

那么最简单的存储方法是使用命令行命令：

scrapy crawl zhihu -o items.json

这条命令就会完成我们的数据保存在根目录的json文件里，我们还可以将他格式保存为msv,pickle等。改变命令后面的格式就可以了。

The above is the detailed content of Tutorial on using Python crawler framework Scrapy. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:segmentfault思否. If there is any infringement, please contact admin@php.cn delete

The Main Purpose of Python: Flexibility and Ease of UseApr 17, 2025 am 12:14 AM

Python's flexibility is reflected in multi-paradigm support and dynamic type systems, while ease of use comes from a simple syntax and rich standard library. 1. Flexibility: Supports object-oriented, functional and procedural programming, and dynamic type systems improve development efficiency. 2. Ease of use: The grammar is close to natural language, the standard library covers a wide range of functions, and simplifies the development process.

Python: The Power of Versatile ProgrammingApr 17, 2025 am 12:09 AM

Python is highly favored for its simplicity and power, suitable for all needs from beginners to advanced developers. Its versatility is reflected in: 1) Easy to learn and use, simple syntax; 2) Rich libraries and frameworks, such as NumPy, Pandas, etc.; 3) Cross-platform support, which can be run on a variety of operating systems; 4) Suitable for scripting and automation tasks to improve work efficiency.

Learning Python in 2 Hours a Day: A Practical GuideApr 17, 2025 am 12:05 AM

Yes, learn Python in two hours a day. 1. Develop a reasonable study plan, 2. Select the right learning resources, 3. Consolidate the knowledge learned through practice. These steps can help you master Python in a short time.

Python vs. C : Pros and Cons for DevelopersApr 17, 2025 am 12:04 AM

Python is suitable for rapid development and data processing, while C is suitable for high performance and underlying control. 1) Python is easy to use, with concise syntax, and is suitable for data science and web development. 2) C has high performance and accurate control, and is often used in gaming and system programming.

Python: Time Commitment and Learning PaceApr 17, 2025 am 12:03 AM

The time required to learn Python varies from person to person, mainly influenced by previous programming experience, learning motivation, learning resources and methods, and learning rhythm. Set realistic learning goals and learn best through practical projects.

Python: Automation, Scripting, and Task ManagementApr 16, 2025 am 12:14 AM

Python excels in automation, scripting, and task management. 1) Automation: File backup is realized through standard libraries such as os and shutil. 2) Script writing: Use the psutil library to monitor system resources. 3) Task management: Use the schedule library to schedule tasks. Python's ease of use and rich library support makes it the preferred tool in these areas.

Python and Time: Making the Most of Your Study TimeApr 14, 2025 am 12:02 AM

To maximize the efficiency of learning Python in a limited time, you can use Python's datetime, time, and schedule modules. 1. The datetime module is used to record and plan learning time. 2. The time module helps to set study and rest time. 3. The schedule module automatically arranges weekly learning tasks.

Python: Games, GUIs, and MoreApr 13, 2025 am 12:14 AM

Python excels in gaming and GUI development. 1) Game development uses Pygame, providing drawing, audio and other functions, which are suitable for creating 2D games. 2) GUI development can choose Tkinter or PyQt. Tkinter is simple and easy to use, PyQt has rich functions and is suitable for professional development.

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Best Graphic Settings

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

2 weeks agoByDDD

R.E.P.O. How to Fix Audio if You Can't Hear Anyone

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Chat Commands and How to Use Them

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

PhpStorm Mac version

The latest (2018.2.1) professional PHP integrated development tool

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

Dreamweaver Mac version

Visual web development tools

Dreamweaver CS6

Visual web development tools

Hot Topics

Where is the login entrance for gmail email?

7544

CakePHP Tutorial

1381

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers