Home  >  Article  >  Backend Development  >  Practical use of crawler framework Scrapy to capture recruitment information in batches

Practical use of crawler framework Scrapy to capture recruitment information in batches

高洛峰
高洛峰Original
2016-10-17 13:49:341026browse

The so-called web crawler is a program that crawls data everywhere or in a specific direction on the Internet. Of course, this statement is not professional enough. A more professional description is to crawl the HTML data of specific website pages. However, since a website has many web pages, and it is impossible for us to know the URL addresses of all web pages in advance, how to ensure that we capture all the HTML pages of the website is a question that needs to be studied. The general method is to define an entry page, and then generally one page will have the URLs of other pages, so these URLs are obtained from the current page and added to the crawler's crawling queue, and then after entering the new page, the above recursively is performed. The operation is actually the same as depth traversal or breadth traversal.

Scrapy is a crawler framework based on Twisted and implemented in pure Python. Users only need to customize and develop a few modules to easily implement a crawler to crawl web content and various images. It is very convenient~

Using Scrapy Twisted is an asynchronous network library that handles network communications. It has a clear architecture and contains various middleware interfaces, which can flexibly fulfill various needs. The overall architecture is shown in the figure below:

Practical use of crawler framework Scrapy to capture recruitment information in batches

The green line is the data flow direction. First, starting from the initial URL, the Scheduler will hand it over to the Downloader for downloading. After downloading, it will be handed over to the Spider for analysis. There are two results analyzed by the Spider. One kind: one is the links that need to be further crawled, such as the "next page" link analyzed before, these things will be sent back to the Scheduler; the other is the data that needs to be saved, they are sent to the Item Pipeline. That's where post-processing (detailed analysis, filtering, storage, etc.) is done on the data. In addition, various middleware can be installed in the data flow channel to perform necessary processing.


I assume you already have Scrapy installed. If you don't have it installed, you can refer to this article.

In this article, we will learn how to use Scrapy to build a crawler program and crawl the content on the specified website

1. Create a new Scrapy Project

2. Define the element Item you need to extract from the webpage

3. Implement a Spider class to complete the function of crawling URLs and extracting Items through the interface

4. Implement an Item PipeLine class to complete the Item storage function

I will use the Tencent recruitment official website as an example.

Github source code: https://github.com/maxliaops/scrapy-itzhaopin

Practical use of crawler framework Scrapy to capture recruitment information in batches

Goal: Grab job recruitment information from Tencent recruitment official website and save it in JSON format.

New project

First, create a new project for our crawler, first enter a directory (any directory we use to save code), execute:

scrapy startprojectitzhaopin

The last itzhaopin is the project name. This command will create a new directory itzhaopin in the current directory, with the following structure:

.

├── itzhaopin

│ ├── itzhaopin

│ │ ├── __init__.py

│ │ ├── items .py

│ │ ├── pipelines.py

│ │ ├── settings.py

│ │ └── spiders

│ │ └── __init__.py

│ └── scrapy.cfg

scrapy.cfg: Project configuration file

items.py: Data structure definition file that needs to be extracted

pipelines.py: Pipeline definition, used to further process the data extracted from items, such as saving, etc.

settings .py: crawler configuration file

spiders: directory where spiders are placed


Define Item

Define the data we want to crawl in items.py:

from scrapy.item import Item, Field
class TencentItem(Item):
    name = Field()                # 职位名称
    catalog = Field()             # 职位类别
    workLocation = Field()        # 工作地点
    recruitNumber = Field()       # 招聘人数
    detailLink = Field()          # 职位详情页链接
    publishTime = Field()         # 发布时间

implement Spider

Spider is an inheritance from scrapy. The Python class of contrib.spiders.CrawlSpider has three required defined members

name: name, the identifier of this spider

start_urls: a list of URLs from which the spider starts crawling

parse(): a method, When the webpage in start_urls is crawled, you need to call this method to parse the webpage content. At the same time, you need to return the next webpage to be crawled, or return the items list

So create a new spider in the spiders directory, tencent_spider.py:

import re
import json
from scrapy.selector import Selector
try:
    from scrapy.spider import Spider
except:
    from scrapy.spider import BaseSpider as Spider
from scrapy.utils.response import get_base_url
from scrapy.utils.url import urljoin_rfc
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor as sle
from itzhaopin.items import *
from itzhaopin.misc.log import *
class TencentSpider(CrawlSpider):
    name = "tencent"
    allowed_domains = ["tencent.com"]
    start_urls = [
        "http://hr.tencent.com/position.php"
    ]
    rules = [ # 定义爬取URL的规则
        Rule(sle(allow=("/position.php\?&start=\d{,4}#a")), follow=True, callback='parse_item')
    ]
    def parse_item(self, response): # 提取数据到Items里面,主要用到XPath和CSS选择器提取网页数据
        items = []
        sel = Selector(response)
        base_url = get_base_url(response)
        sites_even = sel.css('table.tablelist tr.even')
        for site in sites_even:
            item = TencentItem()
            item['name'] = site.css('.l.square a').xpath('text()').extract()
            relative_url = site.css('.l.square a').xpath('@href').extract()[0]
            item['detailLink'] = urljoin_rfc(base_url, relative_url)
            item['catalog'] = site.css('tr > td:nth-child(2)::text').extract()
            item['workLocation'] = site.css('tr > td:nth-child(4)::text').extract()
            item['recruitNumber'] = site.css('tr > td:nth-child(3)::text').extract()
            item['publishTime'] = site.css('tr > td:nth-child(5)::text').extract()
            items.append(item)
            #print repr(item).decode("unicode-escape") + '\n'
        sites_odd = sel.css('table.tablelist tr.odd')
        for site in sites_odd:
            item = TencentItem()
            item['name'] = site.css('.l.square a').xpath('text()').extract()
            relative_url = site.css('.l.square a').xpath('@href').extract()[0]
            item['detailLink'] = urljoin_rfc(base_url, relative_url)
            item['catalog'] = site.css('tr > td:nth-child(2)::text').extract()
            item['workLocation'] = site.css('tr > td:nth-child(4)::text').extract()
            item['recruitNumber'] = site.css('tr > td:nth-child(3)::text').extract()
            item['publishTime'] = site.css('tr > td:nth-child(5)::text').extract()
            items.append(item)
            #print repr(item).decode("unicode-escape") + '\n'
        info('parsed ' + str(response))
        return items
    def _process_request(self, request):
        info('process ' + str(request))
        return request

Implement PipeLine

PipeLine is used to save the Item list returned by Spider, and can be written to a file, database, etc.

PipeLine has only one method that needs to be implemented: process_item. For example, we save Item to a JSON format file:

pipelines.py

from scrapy import signals
import json
import codecs
class JsonWithEncodingTencentPipeline(object):
    def __init__(self):
        self.file = codecs.open('tencent.json', 'w', encoding='utf-8')
    def process_item(self, item, spider):
        line = json.dumps(dict(item), ensure_ascii=False) + "\n"
        self.file.write(line)
        return item
    def spider_closed(self, spider):
        self.file.close(
)

Up to now, we have completed the implementation of a basic crawler. You can enter the following command To start this Spider:

scrapy crawl tencent

After the crawler is finished running, a file named tencent.json will be generated in the current directory, which saves job recruitment information in JSON format.

Part of the content is as follows:

{"recruitNumber": ["1"], "name": ["SD5-Senior Mobile Game Planning (Shenzhen)"], "detailLink": "http://hr.tencent.com/position_detail.php?id =15626&keywords=&tid=0&lid=0", "publishTime": ["2014-04-25"], "catalog": ["Product/Project Category"], "workLocation": ["Shenzhen"]}

{ "recruitNumber": ["1"], "name": ["TEG13-Backend Development Engineer (Shenzhen)"], "detailLink": "http://hr.tencent.com/position_detail.php?id=15666&keywords= &tid=0&lid=0", "publishTime": ["2014-04-25"], "catalog": ["Technology"], "workLocation": ["Shenzhen"]}

{"recruitNumber": [ "2"], "name": ["TEG12-Data Center Senior Manager (Shenzhen)"], "detailLink": "http://hr.tencent.com/position_detail.php?id=15698&keywords=&tid=0&lid= 0", "publishTime": ["2014-04-25"], "catalog": ["Technical Category"], "workLocation": ["Shenzhen"]}

{"recruitNumber": ["1"] , "name": ["GY1-WeChat Pay Brand Planning Manager (Shenzhen)"], "detailLink": "http://hr.tencent.com/position_detail.php?id=15710&keywords=&tid=0&lid=0", "publishTime": ["2014-04-25"], "catalog": ["Market Class"], "workLocation": ["Shenzhen"]}

{"recruitNumber": ["2"], "name ": ["SNG06-Backend Development Engineer (Shenzhen)"], "detailLink": "http://hr.tencent.com/position_detail.php?id=15499&keywords=&tid=0&lid=0", "publishTime": [ "2014-04-25"], "catalog": ["Technology"], "workLocation": ["Shenzhen"]}

{"recruitNumber": ["2"], "name": ["OMG01 -Tencent Fashion Video Planning Editor (Beijing)"], "detailLink": "http://hr.tencent.com/position_detail.php?id=15694&keywords=&tid=0&lid=0", "publishTime": ["2014- 04-25"], "catalog": ["Content Editing Class"], "workLocation": ["Beijing"]}

{"recruitNumber": ["1"], "name": ["HY08-QT Client Windows Development Engineer (Shenzhen)"], "detailLink": "http://hr.tencent.com/position_detail.php?id=11378&keywords=&tid=0&lid=0", "publishTime": ["2014-04 -25"], "catalog": ["Technology"], "workLocation": ["Shenzhen"]}

{"recruitNumber": ["5"], "name": ["HY1-Mobile Game Test Manager (Shanghai)"], "detailLink": "http://hr.tencent.com/position_detail.php?id=15607&keywords=&tid=0&lid=0", "publishTime": ["2014-04-25"] , "catalog": ["Technical"], "workLocation": ["Shanghai"]}

{"recruitNumber": ["1"], "name": ["HY6-Internet Cafe Platform Senior Product Manager (Shenzhen )"], "detailLink": "http://hr.tencent.com/position_detail.php?id=10974&keywords=&tid=0&lid=0", "publishTime": ["2014-04-25"], "catalog ": ["Product/Project Category"], "workLocation": ["Shenzhen"]}

{"recruitNumber": ["4"], "name": ["TEG14-Cloud Storage R&D Engineer (Shenzhen)" ], "detailLink": "http://hr.tencent.com/position_detail.php?id=15168&keywords=&tid=0&lid=0", "publishTime": ["2014-04-24"], "catalog": ["Technical"], "workLocation": ["Shenzhen"]}

{"recruitNumber": ["1"], "name": ["CB-Compensation Manager (Shenzhen)"], "detailLink": "http://hr.tencent.com/position_detail.php?id=2309&keywords=&tid=0&lid=0", "publishTime": ["2013-11-28"], "catalog": ["Functional Class"] , "workLocation": ["Shenzhen"]}


Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn