Scrapy crawler practice: How to crawl the Chinese Academy of Social Sciences document database data?-Python Tutorial-php.cn

Home

Backend Development

Python Tutorial

Scrapy crawler practice: How to crawl the Chinese Academy of Social Sciences document database data?

王林

Jun 22, 2023 am 08:36 AM

reptilescrapyDocumentation library

With the development of the Internet, the digitization of various information has become a trend, so the large amount of data on the website is becoming more and more important. Crawling the data can make analysis and processing more convenient. The scrapy framework is one of the commonly used crawler tools. This article will introduce how to crawl the Chinese Academy of Social Sciences document database data through the scrapy crawler.

1. Install scrapy

Scrapy is an open source web crawler framework based on python that can be used to crawl websites and extract data. Before we begin, we need to install scrapy first. The installation command is as follows:

pip install scrapy

2. Write the crawler code

Next, we need to create a scrapy project and write the crawler code. First, use the terminal to create a new scrapy project:

scrapy startproject cssrc

Then, enter the project directory and create a new spider:

cd cssrc
scrapy genspider cssrc_spider cssrc.ac.cn

In the spider file, we need to set some parameters. Specifically, we need to set the start_urls parameter to define the URLs we want to crawl, and the parse function to process the response data of the website. The settings are as follows:

# -*- coding: utf-8 -*-
import scrapy


class CssrcSpiderSpider(scrapy.Spider):
    name = 'cssrc_spider'
    allowed_domains = ['cssrc.ac.cn']
    start_urls = ['http://www.cssrc.ac.cn']

    def parse(self, response):
        pass

With the basic settings in place, we need to write code to extract the data on the website. Specifically, we need to find where the target data is and extract it through code. In this example, we need to find the specific page of the literature library and extract the corresponding data. The code is as follows:

# -*- coding: utf-8 -*-
import scrapy


class CssrcSpiderSpider(scrapy.Spider):
    name = 'cssrc_spider'
    allowed_domains = ['cssrc.ac.cn']
    start_urls = ['http://www.cssrc.ac.cn']

    def parse(self, response):
        url = 'http://cssrc.ac.cn/report-v1/search.jsp'   # 文献库页面网址
        yield scrapy.Request(url, callback=self.parse_search)  # 发送请求

    def parse_search(self, response):
        # 发送post请求并得到响应
        yield scrapy.FormRequest.from_response(
            response,
            formdata={
                '__search_source__': 'T',   # 搜索类型为文献
                'fldsort': '0',   # 按相关度排序
                'Title': '',   # 标题
                'Author': '',   # 第一作者
                'Author2': '',   # 第二作者
                'Organ': '',   # 机构
                'Keyword': '',   # 关键词
                'Cls': '',   # 分类号
                '___action___': 'search'   # 请求类型为搜索
            },
            callback=self.parse_result   # 处理搜索结果的函数
        )

    def parse_result(self, response):
        # 通过xpath找到结果列表中的各个元素，并提取其中的文献信息
        result_list = response.xpath('//div[@class="info_content"]/div')
        for res in result_list:
            title = res.xpath('a[@class="title"]/text()').extract_first().strip()   # 文献标题
            authors = res.xpath('div[@class="yiyu"]/text()').extract_first().strip()   # 作者
            date = res.xpath('div[@class="date"]/text()').extract_first().strip()   # 出版日期
            url = res.xpath('a[@class="title"]/@href').extract_first()   # 文献详情页的url
            yield {
                'title': title,
                'authors': authors,
                'date': date,
                'url': url
            }

3. Run the crawler

After writing the code, we can use the command to run the crawler and obtain the data. Specifically, we can use the following command to run the scrapy program:

scrapy crawl cssrc_spider -o cssrc.json

Among them, cssrc_spider is the spider name we set before, and cssrc.json is the output we Data file name. After running the command, the program will automatically run and output data.

4. Summary

This article introduces how to use the scrapy framework to crawl the Chinese Academy of Social Sciences document database data. Through this article, we can understand the basic principles of crawlers and how to use the scrapy framework for crawling. At the same time, we also learned how to extract data through xpath, and use regular expressions and encoding processing techniques to solve problems such as Chinese garbled characters. I hope this article can help you and has certain reference value for the implementation of crawlers on other websites.

The above is the detailed content of Scrapy crawler practice: How to crawl the Chinese Academy of Social Sciences document database data?. For more information, please follow other related articles on the PHP Chinese website!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

爬虫实战：用 PHP 爬取京东商品信息Jun 13, 2023 am 11:11 AM

在当今的电商时代，京东作为中国最大的综合电商之一，每日上架的商品数量甚至可以达到数万种。对于广大的消费者来说，京东提供了广泛的商品选择和优势的价格优惠。但是，有些时候，我们需要批量获取京东商品信息，快速筛选、比较、分析等等。这时候，我们就需要用到爬虫技术了。在本篇文章中，我们将会介绍利用PHP语言编写爬虫，帮助我们快速爬取京东商品信息的实现。准备工作首先，我

PHP 爬虫实战：爬取 Twitter 上的数据Jun 13, 2023 pm 01:17 PM

在数字化时代下，社交媒体已经成为人们生活中不可或缺的一部分。Twitter作为其中的代表，每天有数亿用户在上面分享各种信息。对于一些研究、分析、推销等需求，获取Twitter上的相关数据是非常必要的。本文将介绍如何使用PHP编写一个简单的Twitter爬虫，爬取一些关键字相关的数据并存储在数据库中。一、TwitterAPITwitter提供

爬虫技巧：如何在 PHP 中处理 CookieJun 13, 2023 pm 02:54 PM

在爬虫开发中，处理Cookie常常是必不可少的一环。Cookie作为HTTP中的一种状态管理机制，通常被用来记录用户的登录信息和行为，是爬虫处理用户验证和保持登录状态的关键。在PHP爬虫开发中，处理Cookie需要掌握一些技巧和留意一些坑点。下面我们详细介绍如何在PHP中处理Cookie。一、如何获取Cookie在使用PHP编写

爬虫实战：使用PHP爬取携程旅游信息Jun 13, 2023 am 10:26 AM

随着旅游业的不断发展，旅游信息变得非常丰富。为了方便大家获取更全面、准确的旅游信息，我们可以使用爬虫来抓取旅游网站上的数据，并进行分析和处理。本文将介绍如何使用PHP爬取携程旅游信息。爬虫基础知识爬虫是一种自动化程序，可以模拟用户访问网站并获取网站上的数据。爬虫一般分为以下几步：发起请求：爬虫程序会向目标网站发起HTTP请求，获取目标网站的HTML代码。解析

PHP 爬虫实战：爬取百度搜索结果Jun 13, 2023 pm 12:39 PM

随着互联网的发展，我们可以通过各种搜索引擎轻易地获得各种信息。而对于开发者来说，如何从搜索引擎中获取各种数据，是一项非常重要的技能。今天，我们来学习如何使用PHP编写一个爬虫，来爬取百度搜索结果。一、爬虫工作原理在开始之前，我们先来了解一下爬虫工作的基本原理。首先，爬虫会发送请求给服务器，请求网站的内容。服务器接收到请求之后，会返回网页的内容。爬虫收到内

使用 PHP 和 Selenium WebDriver 实现爬虫Jun 13, 2023 am 10:06 AM

随着互联网的蓬勃发展，我们可以轻松地获取海量的数据。而爬虫则是其中一种常见的数据获取方式，特别是在需要大量数据的数据分析和研究领域中，爬虫的应用越来越广泛。本文将介绍如何使用PHP和SeleniumWebDriver实现爬虫。一、什么是SeleniumWebDriver？SeleniumWebDriver是一种自动化测试工具，主要用于模拟人

Python中的爬虫实战：微信公众号爬虫Jun 10, 2023 am 09:01 AM

Python是一种优雅的编程语言，拥有强大的数据处理和网络爬虫功能。在这个数字化时代，互联网上充满了大量的数据，爬虫已成为获取数据的重要手段，因此，Python爬虫在数据分析和挖掘方面有着广泛的应用。在本文中，我们将介绍如何使用Python爬虫来获取微信公众号文章信息。微信公众号是一种流行的社交媒体平台，用于在线发布文章，是许多公司和自媒体推广和营销的重要工

PHP 爬虫实战之获取网页源码和内容解析Jun 13, 2023 am 10:46 AM

PHP爬虫是一种自动化获取网页信息的程序，它可以获取网页代码、抓取数据并存储到本地或数据库中。使用爬虫可以快速获取大量的数据，为后续的数据分析和处理提供巨大的帮助。本文将介绍如何使用PHP实现一个简单的爬虫，以获取网页源码和内容解析。一、获取网页源码在开始之前，我们应该先了解一下HTTP协议和HTML的基本结构。HTTP是HyperText

See all articles