Home > Article > Backend Development > Scrapy crawler realizes Qianku.com's beautiful picture data acquisition and popular sharing ranking
Scrapy crawler realizes Qianku.com’s beautiful picture data acquisition and popular sharing ranking
With the popularity of the Internet, people’s demand for images has gradually increased, and beautiful picture websites have also emerged. Qianku.com is a platform that specializes in providing high-definition pictures and material resources. There are a large number of exquisite picture materials that can be downloaded for free by users and can also replace commercial art resources. However, downloading these beautiful images manually is time-consuming and inefficient. Therefore, this article introduces how to use Scrapy crawler to obtain beautiful picture data and rank popular shares on Qianku.com.
1. Install Scrapy
Before installing Scrapy, we need to install the Python environment in advance. In the Python environment, Scrapy can be installed through the pip install scrapy command.
2. Create a Scrapy project
Open the command line terminal, enter the directory where you want to create the project, and enter the command scrapy startproject qkspider to create a project named "qkspider" in the directory Scrapy project.
3. Create a crawler
Enter the project directory and enter the command scrapy genspider qk qkpic.com to create a crawler named "qk" under the spiders folder.
4. Write code
1. Modify the settings.py file
First, open the settings.py file in the qkspider directory and add the following code to it:
ITEM_PIPELINES = {'qkspider.pipelines.QkspiderPipeline':100,}
This is to store the obtained beautiful picture data in the database.
2. Modify the pipelines.py file
Next, we need to open the pipelines.py file in the qkspider directory and add the following code in it:
import pymongo class QkspiderPipeline(object): def __init__(self): client = pymongo.MongoClient("mongodb://localhost:27017/") db = client['qkdb'] self.collection = db['qkpic'] def process_item(self, item, spider): self.collection.insert(dict(item)) return item
This is to The obtained beautiful picture data is stored in MongoDB.
3. Modify the items.py file
Open the items.py file in the qkspider directory and add the following code in it:
import scrapy class QkspiderItem(scrapy.Item): title = scrapy.Field() img_url = scrapy.Field() share_num = scrapy.Field()
This is to define the data to be obtained type.
4. Modify the qk.py file
Open the qk.py file under the spiders folder and add the following code in it:
import scrapy from qkspider.items import QkspiderItem class QkSpider(scrapy.Spider): name = "qk" allowed_domains = ["qkpic.com"] start_urls = ["http://www.qkpic.com/"] def parse(self, response): items = [] pic_lists = response.xpath('//div[@class="index_mianpic"]/ul/li') for i, pic in enumerate(pic_lists): item = QkspiderItem() item['title'] = pic.xpath('./a/@title').extract_first() item['img_url'] = pic.xpath('./a/img/@src').extract_first() item['share_num'] = int(pic.xpath('./span/em/text()').extract_first()) items.append(item) return items
This is to define what kind of rules Crawl the beautiful picture data of Qianku website and store the data in MongoDB. In this code, we specify the beautiful picture information to be obtained, including the beautiful picture title, URL address, and share volume.
5. Run the crawler
Now, we can run the crawler program we just wrote in the command line terminal. Enter the command scrapy crawl qk in the qkspider directory to run the program in the qk.py file and start crawling the beautiful image data of the Qianku website and storing it in MongoDB.
6. Achieve popular sharing ranking
In order to obtain the popular sharing ranking of Qianku website, we need to obtain the URL of the popular list page and add the following code to the qk.py file:
class QkSpider(scrapy.Spider): name = "qk" allowed_domains = ["qkpic.com"] start_urls = ["http://www.qkpic.com/", "http://www.qkpic.com/top/"] def parse(self, response): if response.url.startswith('http://www.qkpic.com/top/'): items = self.parse_rank_list(response) else: items = self.parse_pic_info(response) return items # 爬取热门榜单信息 def parse_rank_list(self, response): items = [] pic_lists = response.xpath('//div[@class="topcont"]/ul/li') for i, pic in enumerate(pic_lists): item = QkspiderItem() item['title'] = pic.xpath('./a/@title').extract_first() item['img_url'] = pic.xpath('./a/img/@src').extract_first() item['share_num'] = int(pic.xpath('./div[1]/i[2]/text()').extract_first()) items.append(item) return items
In this code, we assign start_urls to the website homepage and popular list page, and add a new function parse_rank_list.
7. Summary
This article introduces how to use the Scrapy crawler framework to crawl the beautiful image data of the Qianku website. During the crawler process, we define the data types we need to obtain and use MongoDB to store the obtained data. In addition, this article also introduces how to obtain the popular sharing ranking list of Qianku website to expand the functionality of the crawler program.
The above is the detailed content of Scrapy crawler realizes Qianku.com's beautiful picture data acquisition and popular sharing ranking. For more information, please follow other related articles on the PHP Chinese website!