Home >Backend Development >Python Tutorial >Scrapy in action: crawling Baidu news data

Scrapy in action: crawling Baidu news data

WBOY
WBOYOriginal
2023-06-23 08:50:091821browse

Scrapy in action: Crawl Baidu news data

With the development of the Internet, the main way people obtain information has shifted from traditional media to the Internet, and people increasingly rely on the Internet to obtain news information. For researchers or analysts, a large amount of data is needed for analysis and research. Therefore, this article will introduce how to use Scrapy to crawl Baidu news data.

Scrapy is an open source Python crawler framework that can crawl website data quickly and efficiently. Scrapy provides powerful web page parsing and crawling functions, as well as good scalability and a high degree of customization.

Step 1: Install Scrapy

Before you start, you need to install Scrapy and some other libraries. The installation can be completed through the following command:

pip install scrapy
pip install requests
pip install bs4

Step 2: Create a Scrapy project

Create a Scrapy project through the following command:

scrapy startproject baiduNews

After the command is executed, A folder named baiduNews will be created in the current directory, which contains the initial structure of a Scrapy project.

Step 3: Write Spider

In Scrapy, Spider is a processor used to crawl web content. We need to write a Spider to obtain data from Baidu News website. First, we need to create a spiders folder in the project root directory and create a Python file in it to fit the spider template.

import scrapy

class BaiduSpider(scrapy.Spider):
    name = "baidu"
    start_urls = [
        "http://news.baidu.com/"
    ]

    def parse(self, response):
        pass

In the above code, we first imported the Scrapy library and created a class named BaiduSpider. In the class, we define a variable start_urls, which is a list containing Baidu News URLs. The parse method is the core function for performing data capture. Here, we just define an empty function. Now, we need to define a template to get the news data.

import scrapy
from baiduNews.items import BaidunewsItem
from bs4 import BeautifulSoup

class BaiduSpider(scrapy.Spider):
    name = "baidu"
    start_urls = [
        "http://news.baidu.com/"
    ]

    def parse(self, response):
        soup = BeautifulSoup(response.body, "html.parser")

        results = soup.find_all("div", class_="hdline_article_tit")
        for res in results:
            item = BaidunewsItem()
            item["title"] = res.a.string.strip()
            item["url"] = res.a.get("href").strip()
            item["source"] = "百度新闻"
            yield item

In the above code, we found all elements with class hdline_article_tit, which are the headlines of Baidu News. We then use BeautifulSoup to parse the page and create a BaidunewsItem class object in a loop. Finally, we return the data through the yield statement.

Step 4: Define Item

In Scrapy, Item is used to define the crawled data structure. We need to define an Item template in the items.py file in the project.

import scrapy

class BaidunewsItem(scrapy.Item):
    title = scrapy.Field()
    url = scrapy.Field()
    source = scrapy.Field()

Step 5: Start Spider and output data

We only need to run the following command to start the Spider and output data:

scrapy crawl baidu -o baiduNews.csv

After the command is executed, it will Create a file named baiduNews.csv in the project root directory, containing all crawled news data.

Summary

With Scrapy, we can quickly and efficiently obtain Baidu news data and save it locally. Scrapy has good scalability and supports output in multiple data formats. This article only introduces a simple application scenario of Scrapy, but Scrapy still has many powerful functions waiting for us to explore.

The above is the detailed content of Scrapy in action: crawling Baidu news data. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn