Home  >  Article  >  Backend Development  >  Scrapy crawler practice: How to crawl the Chinese Academy of Social Sciences document database data?

Scrapy crawler practice: How to crawl the Chinese Academy of Social Sciences document database data?

王林
王林Original
2023-06-22 08:36:061341browse

With the development of the Internet, the digitization of various information has become a trend, so the large amount of data on the website is becoming more and more important. Crawling the data can make analysis and processing more convenient. The scrapy framework is one of the commonly used crawler tools. This article will introduce how to crawl the Chinese Academy of Social Sciences document database data through the scrapy crawler.

1. Install scrapy

Scrapy is an open source web crawler framework based on python that can be used to crawl websites and extract data. Before we begin, we need to install scrapy first. The installation command is as follows:

pip install scrapy

2. Write the crawler code

Next, we need to create a scrapy project and write the crawler code. First, use the terminal to create a new scrapy project:

scrapy startproject cssrc

Then, enter the project directory and create a new spider:

cd cssrc
scrapy genspider cssrc_spider cssrc.ac.cn

In the spider file, we need to set some parameters. Specifically, we need to set the start_urls parameter to define the URLs we want to crawl, and the parse function to process the response data of the website. The settings are as follows:

# -*- coding: utf-8 -*-
import scrapy


class CssrcSpiderSpider(scrapy.Spider):
    name = 'cssrc_spider'
    allowed_domains = ['cssrc.ac.cn']
    start_urls = ['http://www.cssrc.ac.cn']

    def parse(self, response):
        pass

With the basic settings in place, we need to write code to extract the data on the website. Specifically, we need to find where the target data is and extract it through code. In this example, we need to find the specific page of the literature library and extract the corresponding data. The code is as follows:

# -*- coding: utf-8 -*-
import scrapy


class CssrcSpiderSpider(scrapy.Spider):
    name = 'cssrc_spider'
    allowed_domains = ['cssrc.ac.cn']
    start_urls = ['http://www.cssrc.ac.cn']

    def parse(self, response):
        url = 'http://cssrc.ac.cn/report-v1/search.jsp'   # 文献库页面网址
        yield scrapy.Request(url, callback=self.parse_search)  # 发送请求

    def parse_search(self, response):
        # 发送post请求并得到响应
        yield scrapy.FormRequest.from_response(
            response,
            formdata={
                '__search_source__': 'T',   # 搜索类型为文献
                'fldsort': '0',   # 按相关度排序
                'Title': '',   # 标题
                'Author': '',   # 第一作者
                'Author2': '',   # 第二作者
                'Organ': '',   # 机构
                'Keyword': '',   # 关键词
                'Cls': '',   # 分类号
                '___action___': 'search'   # 请求类型为搜索
            },
            callback=self.parse_result   # 处理搜索结果的函数
        )

    def parse_result(self, response):
        # 通过xpath找到结果列表中的各个元素,并提取其中的文献信息
        result_list = response.xpath('//div[@class="info_content"]/div')
        for res in result_list:
            title = res.xpath('a[@class="title"]/text()').extract_first().strip()   # 文献标题
            authors = res.xpath('div[@class="yiyu"]/text()').extract_first().strip()   # 作者
            date = res.xpath('div[@class="date"]/text()').extract_first().strip()   # 出版日期
            url = res.xpath('a[@class="title"]/@href').extract_first()   # 文献详情页的url
            yield {
                'title': title,
                'authors': authors,
                'date': date,
                'url': url
            }

3. Run the crawler

After writing the code, we can use the command to run the crawler and obtain the data. Specifically, we can use the following command to run the scrapy program:

scrapy crawl cssrc_spider -o cssrc.json

Among them, cssrc_spider is the spider name we set before, and cssrc.json is the output we Data file name. After running the command, the program will automatically run and output data.

4. Summary

This article introduces how to use the scrapy framework to crawl the Chinese Academy of Social Sciences document database data. Through this article, we can understand the basic principles of crawlers and how to use the scrapy framework for crawling. At the same time, we also learned how to extract data through xpath, and use regular expressions and encoding processing techniques to solve problems such as Chinese garbled characters. I hope this article can help you and has certain reference value for the implementation of crawlers on other websites.

The above is the detailed content of Scrapy crawler practice: How to crawl the Chinese Academy of Social Sciences document database data?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn