Home >Backend Development >Python Tutorial >Scrapy framework practice: crawling Jianshu website data
Scrapy framework practice: crawling Jianshu website data
Scrapy is an open source Python crawler framework that can be used to extract data from the World Wide Web. In this article, we will introduce the Scrapy framework and use it to crawl data from Jianshu websites.
Scrapy can be installed using package managers such as pip or conda. Here, we use pip to install Scrapy. Enter the following command in the command line:
pip install scrapy
After the installation is complete, you can use the following command to check whether Scrapy has been successfully installed:
scrapy version
If you see something similar to "Scrapy x.x.x - no active project" output, Scrapy has been installed successfully.
Before we start using Scrapy, we need to create a Scrapy project. Enter the following command at the command line:
scrapy startproject jianshu
This will create a Scrapy project named "jianshu" in the current directory.
In Scrapy, a crawler is a component that processes data extracted from a website. We use Scrapy Shell to analyze Jianshu website and create crawlers.
Enter the following command at the command line:
scrapy shell "https://www.jianshu.com"
This will launch the Scrapy Shell, where we can view the page source code and elements of the Jianshu website in order to create selectors for our crawler .
For example, we can use the following selector to extract the article title:
response.css('h1.title::text').extract_first()
We can use the following selector to extract the article author:
response.css('a.name::text').extract_first()
Tested in Scrapy Shell After selecting the selector, we can create a new Python file for our crawler. Enter the following command at the command line:
scrapy genspider jianshu_spider jianshu.com
This will create a Scrapy crawler named "jianshu_spider". We can add the selector we tested in Scrapy Shell to the crawler's .py file and specify the data to extract.
For example, the following code extracts the titles and authors of all articles on the home page of the Jianshu website:
import scrapy class JianshuSpider(scrapy.Spider): name = 'jianshu_spider' allowed_domains = ['jianshu.com'] start_urls = ['https://www.jianshu.com/'] def parse(self, response): for article in response.css('li[data-note-id]'): yield { 'title': article.css('a.title::text').extract_first(), 'author': article.css('a.name::text').extract_first(), }
Now , we execute the Scrapy crawler in command line mode and output the results to a JSON file. Enter the following command at the command line:
scrapy crawl jianshu_spider -o articles.json
This command will run our crawler and save the output data to a JSON file called "articles.json".
In this article, we introduced the Scrapy framework and used it to crawl data from the Jianshu website. Extracting data from websites is easy using the Scrapy framework, and Scrapy can be scaled into large-scale data extraction applications due to its concurrency and scalability.
The above is the detailed content of Scrapy framework practice: crawling Jianshu website data. For more information, please follow other related articles on the PHP Chinese website!