Preface

This is my first time writing a blog. The main content is to crawl articles from WeChat public accounts and save the articles locally in PDF format.

Crawling WeChat public account articles (using wechatsogou)

1. Installation

pip install wechatsogou --upgrade

wechatsogou is a WeChat public account crawler interface based on Sogou WeChat search

2. Usage method

The usage method is as follows

import wechatsogou
# captcha_break_time为验证码输入错误的重试次数，默认为1
ws_api = wechatsogou.WechatSogouAPI(captcha_break_time=3)
# 公众号名称
gzh_name = &#39;&#39;
# 将该公众号最近10篇文章信息以字典形式返回
data = ws_api.get_gzh_article_by_history(gzh_name)

data data structure:

{
    &#39;gzh&#39;: {
        &#39;wechat_name&#39;: &#39;&#39;,  # 名称
        &#39;wechat_id&#39;: &#39;&#39;,  # 微信id
        &#39;introduction&#39;: &#39;&#39;,  # 简介
        &#39;authentication&#39;: &#39;&#39;,  # 认证
        &#39;headimage&#39;: &#39;&#39;  # 头像
    },
    &#39;article&#39;: [
        {
            &#39;send_id&#39;: int,  # 群发id，注意不唯一，因为同一次群发多个消息，而群发id一致
            &#39;datetime&#39;: int,  # 群发datatime 10位时间戳
            &#39;type&#39;: &#39;&#39;,  # 消息类型，均是49（在手机端历史消息页有其他类型，网页端最近10条消息页只有49），表示图文
            &#39;main&#39;: int,  # 是否是一次群发的第一次消息 1 or 0
            &#39;title&#39;: &#39;&#39;,  # 文章标题
            &#39;abstract&#39;: &#39;&#39;,  # 摘要
            &#39;fileid&#39;: int,  #
            &#39;content_url&#39;: &#39;&#39;,  # 文章链接
            &#39;source_url&#39;: &#39;&#39;,  # 阅读原文的链接
            &#39;cover&#39;: &#39;&#39;,  # 封面图
            &#39;author&#39;: &#39;&#39;,  # 作者
            &#39;copyright_stat&#39;: int,  # 文章类型，例如：原创啊
        },
        ...
    ]
}

Two pieces of information need to be obtained here: article title and article url.

After getting the article url, you can convert the html page into a pdf file based on the url.

Generate PDF files

1.Install wkhtmltopdf

Download address: https://wkhtmltopdf.org/downloads.html

2.Install pdfkit

pip install pdfkit

3. How to use

import pdfkit
# 根据url生成pdf
pdfkit.from_url(&#39;http://baidu.com&#39;,&#39;out.pdf&#39;)
# 根据html文件生成pdf
pdfkit.from_file(&#39;test.html&#39;,&#39;out.pdf&#39;)
# 根据html代码生成pdf
pdfkit.from_string(&#39;Hello!&#39;,&#39;out.pdf&#39;)

If you directly use the article URL obtained above to generate pdf, there will be a problem that the pdf file does not display the article image.

Solution:

# 该方法根据文章url对html进行处理，使图片显示
content_info = ws_api.get_article_content(url)
# 得到html代码(代码不完整，需要加入head、body等标签)
html_code = content_info[&#39;content_html&#39;]

Then construct the complete html code based on html_code and call the pdfkit.from_string() method to generate the pdf file. At this time, you will find the pictures in the article It is displayed in the pdf file.

Complete code

import os
import pdfkit
import datetime
import wechatsogou

# 初始化API
ws_api = wechatsogou.WechatSogouAPI(captcha_break_time=3)


def url2pdf(url, title, targetPath):
    &#39;&#39;&#39;
    使用pdfkit生成pdf文件
    :param url: 文章url
    :param title: 文章标题
    :param targetPath: 存储pdf文件的路径
    &#39;&#39;&#39;
    try:
        content_info = ws_api.get_article_content(url)
    except:
        return False
    # 处理后的html
    html = f&#39;&#39;&#39;
    <!DOCTYPE html>
    <html lang="en">
    <head>
        <meta charset="UTF-8">
        <title>{title}</title>
    </head>
    <body>
    <h2 id="title">{title}</h2>
    {content_info[&#39;content_html&#39;]}
    </body>
    </html>
    &#39;&#39;&#39;
    try:
        pdfkit.from_string(html, targetPath + os.path.sep + f&#39;{title}.pdf&#39;)
    except:
        # 部分文章标题含特殊字符，不能作为文件名
        filename = datetime.datetime.now().strftime(&#39;%Y%m%d%H%M%S&#39;) + &#39;.pdf&#39;
        pdfkit.from_string(html, targetPath + os.path.sep + filename)


if __name__ == &#39;__main__&#39;:
    # 此处为要爬取公众号的名称
    gzh_name = &#39;&#39;
    targetPath = os.getcwd() + os.path.sep + gzh_name
    # 如果不存在目标文件夹就进行创建
    if not os.path.exists(targetPath):
        os.makedirs(targetPath)
    # 将该公众号最近10篇文章信息以字典形式返回
    data = ws_api.get_gzh_article_by_history(gzh_name)
    article_list = data[&#39;article&#39;]
    for article in article_list:
        url = article[&#39;content_url&#39;]
        title = article[&#39;title&#39;]
        url2pdf(url, title, targetPath)

Related learning recommendations: python tutorial

The above is the detailed content of Crawl WeChat public account articles and save them as PDF files (Python method). For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:CSDN. If there is any infringement, please contact admin@php.cn delete