Home  >  Article  >  Backend Development  >  Scrapy implements crawling and analysis of WeChat public account articles

Scrapy implements crawling and analysis of WeChat public account articles

WBOY
WBOYOriginal
2023-06-22 09:41:241845browse

Scrapy implements crawling and analysis of WeChat public account articles

WeChat is a very popular social media application in recent years, and the public accounts operated in it also play a very important role. As we all know, WeChat public accounts are an ocean of information and knowledge, because each public account can publish articles, graphic messages and other information. This information can be widely used in many fields, such as media reports, academic research, etc.

So, this article will introduce how to use the Scrapy framework to crawl and analyze WeChat public account articles. Scrapy is a Python web crawler framework whose main function is data mining and information search. Therefore, Scrapy is very customizable and efficient.

  1. Install Scrapy and create a project

To use the Scrapy framework for crawling, you first need to install Scrapy and other dependencies. You can use the pip command to install. The installation process is as follows:

pip install scrapy
pip install pymongo
pip install mysql-connector-python

After installing Scrapy, we need to use the Scrapy command line tool to create the project. The command is as follows:

scrapy startproject wechat

After executing this command, Scrapy will create a project named "wechat" and create many files and directories in the project directory.

  1. Implement crawling of WeChat public account articles

Before we start crawling, we need to first understand the URL format of the WeChat public account article page. The URL of a typical WeChat public account article page looks like this:

https://mp.weixin.qq.com/s?__biz=XXX&mid=XXX&idx=1&sn=XXX&chksm=XXX#wechat_redirect

Among them, __biz represents the ID of the WeChat public account, mid represents the ID of the article, idx represents the serial number of the article, sn represents the signature of the article, and chksm represents Content verification. Therefore, if we want to crawl all articles of a certain official account, we need to find the ID of the official account and use it to build the URL. Among them, biz_id is the unique identifier of the official account.

First of all, we need to prepare a list containing many official account IDs, because we want to crawl the articles of these official accounts. The collection of IDs can be achieved through various means. Here, we use a list containing several test IDs as an example:

biz_ids = ['MzU5MjcwMzA4MA==', 'MzI4MzMwNDgwMQ==', 'MzAxMTcyMzg2MA==']

Next, we need to write a Spider to crawl all articles of a certain public account. Here, we pass the name and ID of the official account to Spider so that we can handle different official account IDs.

import scrapy
import re

class WeChatSpider(scrapy.Spider):
    name = "wechat"
    allowed_domains = ["mp.weixin.qq.com"]
    
    def __init__(self, name=None, biz_id=None):
        super().__init__(name=name)
        self.start_urls = ['https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz={}==#wechat_redirect'.format(biz_id)]

    def parse(self, response):
        article_urls = response.xpath('//h4[1]/a/@href')
        for url in article_urls.extract():
            yield scrapy.Request(url, callback=self.parse_article)
        
        next_page = response.xpath('//a[@id="js_next"]/@href')
        if next_page:
            yield scrapy.Request(response.urljoin(next_page[0].extract()), callback=self.parse)
    
    def parse_article(self, response):
        url = response.url
        title = response.xpath('//h2[@class="rich_media_title"]/text()')
        yield {'url': url, 'title': title.extract_first().strip()}

The main function of Spider is to use the given official account ID to access the official account homepage, and then recursively traverse each page to extract the URLs of all articles. In addition, the parse_article method is used to extract the URL and title of the article for subsequent processing. Overall, this spider is not very complex, but the extraction speed is slow.

Finally, we need to enter the following command in Terminal to start Spider:

scrapy crawl wechat -a biz_id=XXXXXXXX

Similarly, we can also crawl multiple official accounts, we only need to specify the names of all official accounts in the command Just the ID:

scrapy crawl wechat -a biz_id=ID1,ID2,ID3
  1. Storing article data

After crawling the article, we need to save the title and URL of the article to a database (such as MongoDB, MySQL, etc.). Here, we will use the pymongo library to save the crawled data.

import pymongo

class MongoPipeline(object):
    collection_name = 'wechat'

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        self.db[self.collection_name].insert_one(dict(item))
        return item

In this Pipeline, we use MongoDB as the backend to store data. This class can be modified as needed to use other database systems.

Next, we need to configure database-related parameters in the settings.py file:

MONGO_URI = 'mongodb://localhost:27017/'
MONGO_DATABASE = 'wechat'
ITEM_PIPELINES = {'myproject.pipelines.MongoPipeline': 300}

Finally, we call Pipeline in Spider to store data in MongoDB:

class WeChatSpider(scrapy.Spider):
    name = "wechat"
    allowed_domains = ["mp.weixin.qq.com"]
    
    def __init__(self, name=None, biz_id=None):
        super().__init__(name=name)
        self.start_urls = ['https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz={}==#wechat_redirect'.format(biz_id)]

    def parse(self, response):
        article_urls = response.xpath('//h4[1]/a/@href')
        for url in article_urls.extract():
            yield scrapy.Request(url, callback=self.parse_article)
        
        next_page = response.xpath('//a[@id="js_next"]/@href')
        if next_page:
            yield scrapy.Request(response.urljoin(next_page[0].extract()), callback=self.parse)
            
    def parse_article(self, response):
        url = response.url
        title = response.xpath('//h2[@class="rich_media_title"]/text()')
        yield {'url': url, 'title': title.extract_first().strip()}

        pipeline = response.meta.get('pipeline')
        if pipeline:
            item = dict()
            item['url'] = url
            item['title'] = title.extract_first().strip()
            yield item

In the above code, response.meta.get('pipeline') is used to obtain the Pipeline object we set in Spider. Therefore, just add the following code to the Spider code to support Pipeline:

yield scrapy.Request(url, callback=self.parse_article, meta={'pipeline': 1})
  1. Data Analysis

Finally, we will use libraries such as Scrapy and pandas to Implement data analysis and visualization.

Here we will extract the data we crawled from MongoDB and save it to a CSV file. Subsequently, we can use pandas to process and visualize the CSV file.

The following is the implementation process:

import pandas as pd
from pymongo import MongoClient

client = MongoClient('mongodb://localhost:27017/')
db = client['wechat']
articles = db['wechat']

cursor = articles.find()
doc = list(cursor)

df = pd.DataFrame(doc)
df.to_csv('wechat.csv', encoding='utf-8')

df.groupby('biz_id')['title'].count().plot(kind='bar')

In the above code, we use the MongoDB and Pandas libraries to save the crawled data into the data folder of the CSV file. Subsequently, we used the powerful data analysis function of Pandas to visually display the number of articles of each public account.

The above is the detailed content of Scrapy implements crawling and analysis of WeChat public account articles. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn