Home > Article > Backend Development > Scrapy implements crawling and analysis of WeChat public account articles
Scrapy implements crawling and analysis of WeChat public account articles
WeChat is a very popular social media application in recent years, and the public accounts operated in it also play a very important role. As we all know, WeChat public accounts are an ocean of information and knowledge, because each public account can publish articles, graphic messages and other information. This information can be widely used in many fields, such as media reports, academic research, etc.
So, this article will introduce how to use the Scrapy framework to crawl and analyze WeChat public account articles. Scrapy is a Python web crawler framework whose main function is data mining and information search. Therefore, Scrapy is very customizable and efficient.
To use the Scrapy framework for crawling, you first need to install Scrapy and other dependencies. You can use the pip command to install. The installation process is as follows:
pip install scrapy pip install pymongo pip install mysql-connector-python
After installing Scrapy, we need to use the Scrapy command line tool to create the project. The command is as follows:
scrapy startproject wechat
After executing this command, Scrapy will create a project named "wechat" and create many files and directories in the project directory.
Before we start crawling, we need to first understand the URL format of the WeChat public account article page. The URL of a typical WeChat public account article page looks like this:
https://mp.weixin.qq.com/s?__biz=XXX&mid=XXX&idx=1&sn=XXX&chksm=XXX#wechat_redirect
Among them, __biz represents the ID of the WeChat public account, mid represents the ID of the article, idx represents the serial number of the article, sn represents the signature of the article, and chksm represents Content verification. Therefore, if we want to crawl all articles of a certain official account, we need to find the ID of the official account and use it to build the URL. Among them, biz_id is the unique identifier of the official account.
First of all, we need to prepare a list containing many official account IDs, because we want to crawl the articles of these official accounts. The collection of IDs can be achieved through various means. Here, we use a list containing several test IDs as an example:
biz_ids = ['MzU5MjcwMzA4MA==', 'MzI4MzMwNDgwMQ==', 'MzAxMTcyMzg2MA==']
Next, we need to write a Spider to crawl all articles of a certain public account. Here, we pass the name and ID of the official account to Spider so that we can handle different official account IDs.
import scrapy import re class WeChatSpider(scrapy.Spider): name = "wechat" allowed_domains = ["mp.weixin.qq.com"] def __init__(self, name=None, biz_id=None): super().__init__(name=name) self.start_urls = ['https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz={}==#wechat_redirect'.format(biz_id)] def parse(self, response): article_urls = response.xpath('//h4[1]/a/@href') for url in article_urls.extract(): yield scrapy.Request(url, callback=self.parse_article) next_page = response.xpath('//a[@id="js_next"]/@href') if next_page: yield scrapy.Request(response.urljoin(next_page[0].extract()), callback=self.parse) def parse_article(self, response): url = response.url title = response.xpath('//h2[@class="rich_media_title"]/text()') yield {'url': url, 'title': title.extract_first().strip()}
The main function of Spider is to use the given official account ID to access the official account homepage, and then recursively traverse each page to extract the URLs of all articles. In addition, the parse_article method is used to extract the URL and title of the article for subsequent processing. Overall, this spider is not very complex, but the extraction speed is slow.
Finally, we need to enter the following command in Terminal to start Spider:
scrapy crawl wechat -a biz_id=XXXXXXXX
Similarly, we can also crawl multiple official accounts, we only need to specify the names of all official accounts in the command Just the ID:
scrapy crawl wechat -a biz_id=ID1,ID2,ID3
After crawling the article, we need to save the title and URL of the article to a database (such as MongoDB, MySQL, etc.). Here, we will use the pymongo library to save the crawled data.
import pymongo class MongoPipeline(object): collection_name = 'wechat' def __init__(self, mongo_uri, mongo_db): self.mongo_uri = mongo_uri self.mongo_db = mongo_db @classmethod def from_crawler(cls, crawler): return cls( mongo_uri=crawler.settings.get('MONGO_URI'), mongo_db=crawler.settings.get('MONGO_DATABASE', 'items') ) def open_spider(self, spider): self.client = pymongo.MongoClient(self.mongo_uri) self.db = self.client[self.mongo_db] def close_spider(self, spider): self.client.close() def process_item(self, item, spider): self.db[self.collection_name].insert_one(dict(item)) return item
In this Pipeline, we use MongoDB as the backend to store data. This class can be modified as needed to use other database systems.
Next, we need to configure database-related parameters in the settings.py file:
MONGO_URI = 'mongodb://localhost:27017/' MONGO_DATABASE = 'wechat' ITEM_PIPELINES = {'myproject.pipelines.MongoPipeline': 300}
Finally, we call Pipeline in Spider to store data in MongoDB:
class WeChatSpider(scrapy.Spider): name = "wechat" allowed_domains = ["mp.weixin.qq.com"] def __init__(self, name=None, biz_id=None): super().__init__(name=name) self.start_urls = ['https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz={}==#wechat_redirect'.format(biz_id)] def parse(self, response): article_urls = response.xpath('//h4[1]/a/@href') for url in article_urls.extract(): yield scrapy.Request(url, callback=self.parse_article) next_page = response.xpath('//a[@id="js_next"]/@href') if next_page: yield scrapy.Request(response.urljoin(next_page[0].extract()), callback=self.parse) def parse_article(self, response): url = response.url title = response.xpath('//h2[@class="rich_media_title"]/text()') yield {'url': url, 'title': title.extract_first().strip()} pipeline = response.meta.get('pipeline') if pipeline: item = dict() item['url'] = url item['title'] = title.extract_first().strip() yield item
In the above code, response.meta.get('pipeline') is used to obtain the Pipeline object we set in Spider. Therefore, just add the following code to the Spider code to support Pipeline:
yield scrapy.Request(url, callback=self.parse_article, meta={'pipeline': 1})
Finally, we will use libraries such as Scrapy and pandas to Implement data analysis and visualization.
Here we will extract the data we crawled from MongoDB and save it to a CSV file. Subsequently, we can use pandas to process and visualize the CSV file.
The following is the implementation process:
import pandas as pd from pymongo import MongoClient client = MongoClient('mongodb://localhost:27017/') db = client['wechat'] articles = db['wechat'] cursor = articles.find() doc = list(cursor) df = pd.DataFrame(doc) df.to_csv('wechat.csv', encoding='utf-8') df.groupby('biz_id')['title'].count().plot(kind='bar')
In the above code, we use the MongoDB and Pandas libraries to save the crawled data into the data folder of the CSV file. Subsequently, we used the powerful data analysis function of Pandas to visually display the number of articles of each public account.
The above is the detailed content of Scrapy implements crawling and analysis of WeChat public account articles. For more information, please follow other related articles on the PHP Chinese website!