Home > Article > Backend Development > Scrapy in action: crawling Baidu news data
Scrapy in action: Crawl Baidu news data
With the development of the Internet, the main way people obtain information has shifted from traditional media to the Internet, and people increasingly rely on the Internet to obtain news information. For researchers or analysts, a large amount of data is needed for analysis and research. Therefore, this article will introduce how to use Scrapy to crawl Baidu news data.
Scrapy is an open source Python crawler framework that can crawl website data quickly and efficiently. Scrapy provides powerful web page parsing and crawling functions, as well as good scalability and a high degree of customization.
Step 1: Install Scrapy
Before you start, you need to install Scrapy and some other libraries. The installation can be completed through the following command:
pip install scrapy pip install requests pip install bs4
Step 2: Create a Scrapy project
Create a Scrapy project through the following command:
scrapy startproject baiduNews
After the command is executed, A folder named baiduNews will be created in the current directory, which contains the initial structure of a Scrapy project.
Step 3: Write Spider
In Scrapy, Spider is a processor used to crawl web content. We need to write a Spider to obtain data from Baidu News website. First, we need to create a spiders folder in the project root directory and create a Python file in it to fit the spider template.
import scrapy class BaiduSpider(scrapy.Spider): name = "baidu" start_urls = [ "http://news.baidu.com/" ] def parse(self, response): pass
In the above code, we first imported the Scrapy library and created a class named BaiduSpider. In the class, we define a variable start_urls, which is a list containing Baidu News URLs. The parse method is the core function for performing data capture. Here, we just define an empty function. Now, we need to define a template to get the news data.
import scrapy from baiduNews.items import BaidunewsItem from bs4 import BeautifulSoup class BaiduSpider(scrapy.Spider): name = "baidu" start_urls = [ "http://news.baidu.com/" ] def parse(self, response): soup = BeautifulSoup(response.body, "html.parser") results = soup.find_all("div", class_="hdline_article_tit") for res in results: item = BaidunewsItem() item["title"] = res.a.string.strip() item["url"] = res.a.get("href").strip() item["source"] = "百度新闻" yield item
In the above code, we found all elements with class hdline_article_tit, which are the headlines of Baidu News. We then use BeautifulSoup to parse the page and create a BaidunewsItem class object in a loop. Finally, we return the data through the yield statement.
Step 4: Define Item
In Scrapy, Item is used to define the crawled data structure. We need to define an Item template in the items.py file in the project.
import scrapy class BaidunewsItem(scrapy.Item): title = scrapy.Field() url = scrapy.Field() source = scrapy.Field()
Step 5: Start Spider and output data
We only need to run the following command to start the Spider and output data:
scrapy crawl baidu -o baiduNews.csv
After the command is executed, it will Create a file named baiduNews.csv in the project root directory, containing all crawled news data.
Summary
With Scrapy, we can quickly and efficiently obtain Baidu news data and save it locally. Scrapy has good scalability and supports output in multiple data formats. This article only introduces a simple application scenario of Scrapy, but Scrapy still has many powerful functions waiting for us to explore.
The above is the detailed content of Scrapy in action: crawling Baidu news data. For more information, please follow other related articles on the PHP Chinese website!