Home  >  Article  >  Backend Development  >  Tutorial on using Python crawler framework Scrapy

Tutorial on using Python crawler framework Scrapy

不言
不言forward
2018-10-19 16:02:172597browse

This article brings you a tutorial on using the Python crawler framework Scrapy. It has certain reference value. Friends in need can refer to it. I hope it will be helpful to you.

Hello everyone, in this article we will take a look at the powerful Python crawler framework Scrapy. Scrapy is a simple-to-use, powerful asynchronous crawler framework. Let’s take a look at its installation first.

Installation of Scrapy

The installation of Scrapy is very troublesome. For some people who want to use Scrapy, its installation often makes many people die halfway. Here I will share with you my installation process and the installation methods compiled on the Internet. I hope everyone can install smoothly.

Windows installation

Before we begin, we must make sure that we have installed Python. In this article, we take Python3.5 as an example. Scrapy has many dependent packages, let’s install them one by one.

First, use pip -v to check whether pip is installed normally. If it is normal, then we proceed to the next step;

pip install wheel is a package we have introduced in our previous article. After installing it we You can install some wheel software;

lxml installation. The previous article mentioned its installation, so let’s reorganize it here. whl file address: here. Find the file of your corresponding version. After downloading, find the file location, right-click the file properties, click the security label, and copy its path. Open the administrator tool (cmd), pip install ;

PyOpenssl whl file address: here. Click to download, the whl file installation method is the same as above;

Twisted framework This framework is an asynchronous network library and is the core of Scrapy. whl file address: here;

Pywin32 This is a Pywin32 compatible library, download address: here, select the version to download;

If all the above libraries are installed, then we will Our Scrapy can be installed. Isn't pip install scrapy

very troublesome? If you don't like to bother, you can also install it very conveniently under Windows. Then we have to use Anaconda we mentioned before. You can find the specific installation yourself, or find it in previous articles. Then he only needs one line to install Scrapy:

conda install scrapy

Linux installation

Linux system installation is simpler:

sudo apt- get install build-essential python3-dev libssl-dev libffi-dev libxml2 libxml2-dev libxslt1-dev zlib1g-dev

Mac OS installation

We need to install some C dependent libraries first, xcode -select --install

You need to install the command line development tools, we click to install. Once the installation is complete, the dependent libraries are also installed.

Then we directly use pip to install pip install scrapy

Above, the installation of our Scrapy library is basically solved.

Basic use of Scrapy

Chinese document address of Scrapy: here

Scrapy is an application framework written to crawl website data and extract structural data. It can be used in a series of programs including data mining, information processing or storing historical data.

His basic project process is:

Create a Scrapy project

Define the extracted Item

Write a spider to crawl the website and extract the Item

Write Item Pipeline to store the extracted Item (i.e. data)

Generally our crawler process is:

Catch the index page: request the URL of the index page and get the source code, perform the next step of analysis;

Get the content and next page link: analyze the source code, extract the index page data, and get the next page link, perform the next step of crawling;

Turn over Page crawling: request the next page information, analyze the content and request links on the next page;

Save crawling results: save the crawling results in a specific format and text, or save the database.

Let’s see how to use it step by step.

Create Project

Before you start scraping, you must create a new Scrapy project. Enter the directory where you plan to store the code and run the following command (take Zhihu Daily as an example):

scrapy startproject zhihurb

This command will create a zhihu directory containing the following content:

zhihurb/

scrapy.cfg

zhihurb/

    __init__.py

    items.py

    pipelines.py

    settings.py

    spiders/

        __init__.py

        ...

These files are:

scrapy.cfg: The project’s configuration file zhihurb/: The python module of the project. You will add the code here later. zhihurb/items.py: The item files in the project.zhihurb/pipelines.py: The pipelines files in the project.zhihurb/settings.py: The setting files of the project.zhihurb/spiders/: The directory where the spider code is placed.

Define Item

This step is to define the data information we need to obtain, such as we need to obtain some URLs in the website, the content of the website article, the author of the article, etc. The place where this step is defined is in our items.py file.

import scrapy

class ZhihuItem(scrapy.Item):

name = scrapy.Field()

article = scrapy.Field()

Writing Spider

This step is to write the crawler we are most familiar with, and we The Scrapy framework allows us not to think about the implementation method, but only needs to write the crawling logic.

First we need to create our crawler file in the spiders/ folder, for example, it is called spider.py. Before writing a crawler, we need to define some content first. Let’s take Zhihu Daily as an example: https://daily.zhihu.com/

from scrapy import Spider

class ZhihuSpider(Spider):

name = "zhihu"

allowed_domains = ["zhihu.com"]

start_urls = ['https://daily.zhihu.com/']

这里我们定义了什么呢?首先我们导入了Scrapy的Spider组件。然后创建一个爬虫类,在类里我们定义了我们的爬虫名称:zhihu(注意:爬虫名称独一无二的,是不可以和别的爬虫重复的)。还定义了一个网址范围,和一个起始 url 列表,说明起始 url 可以是多个。

然后我们定义一个解析函数:

def parse(self, response):

print(response.text)

我们直接打印一下,看看这个解析函数是什么。

运行爬虫

scrapy crawl zhihu

由于Scrapy是不支持在IDE中执行,所以我们必须在命令行里执行命令,我们要确定是不是cd到爬虫目录下。然后执行,这里的命令顾名思义,crawl是蜘蛛的意思,zhihu就是我们定义的爬虫名称了。

查看输出,我们先看到的是一些爬虫类的输出,可以看到输出的log中包含定义在 start_urls 的初始URL,并且与spider中是一一对应的。我们接着可以看到打印出了网页源代码。可是我们似乎并没有做什么,就得到了网页的源码,这是Scrapy比较方便的一点。

提取数据

接着就可以使用解析工具解析源码,拿到数据了。

由于Scrapy内置了CSS和xpath选择器,而我们虽然可以使用Beautifulsoup,但是BeautifulSoup的缺点就是慢,这不符合我们Scrapy的风格,所有我还是建议大家使用CSS或者Xpath。

由于之前我并没有写过关于Xpath或者CSS选择器的用法,那么首先这个并不难,而且熟悉浏览器的用法,可以很简单的掌握他们。

我们以提取知乎日报里的文章url为例:

from scrapy import Request

def parse(self, response):

urls = response.xpath('//p[@class="box"]/a/@href').extract()

for url in urls:

    yield Request(url, callback=self.parse_url)

这里我们使用xpath解析出所有的url(extract()是获得所有URL集合,extract_first()是获得第一个)。然后将url利用yield语法糖,回调函数给下一个解析url的函数。

使用item

后面详细的组件使用留在下一章讲解,这里假如我们解析出了文章内容和标题,我们要将提取的数据保存到item容器。

Item对象相当于是自定义的python字典。 您可以使用标准的字典语法来获取到其每个字段的值。(字段即是我们之前用Field赋值的属性)。

假如我们下一个解析函数解析出了数据

def parse_url(self, response):

# name = xxxx

# article = xxxx

# 保存

item = DmozItem()

item['name'] = name

item['article'] = article

# 返回item

yield item

保存爬取到的数据

这里我们需要在管道文件pipelines.py里去操作数据,比如我们要将这些数据的文章标题只保留 5 个字,然后保存在文本里。或者我们要将数据保存到数据库里,这些都是在管道文件里面操作。我们后面在详细讲解。

那么最简单的存储方法是使用命令行命令:

scrapy crawl zhihu -o items.json

这条命令就会完成我们的数据保存在根目录的json文件里,我们还可以将他格式保存为msv,pickle等。改变命令后面的格式就可以了。

The above is the detailed content of Tutorial on using Python crawler framework Scrapy. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:segmentfault.com. If there is any infringement, please contact admin@php.cn delete