Home >Backend Development >Python Tutorial >Python implements a method to capture HTML web pages and save them as PDF files

Python implements a method to capture HTML web pages and save them as PDF files

不言
不言Original
2018-05-08 11:55:384350browse

This article mainly introduces the method of Python to capture HTML web pages and save them in the form of PDF files. It analyzes the installation of the PyPDF2 module and the related operating skills of Python to capture HTML pages and generate PDF files based on the PyPDF2 module in the form of examples. , Friends who need it can refer to

The example in this article describes how Python can capture HTML web pages and save them as PDF files. Share it with everyone for your reference, the details are as follows:

1. Preface

Today I will introduce how to capture HTML web pages and save them as PDF , without further ado, go directly to the tutorial.

2. Preparation

1. Installation and use of PyPDF2 (used to merge PDF):

PyPDF2 version: 1.25.1

Installation:

pip install PyPDF2

Usage example:

from PyPDF2 import PdfFileMerger
merger = PdfFileMerger()
input1 = open("hql_1_20.pdf", "rb")
input2 = open("hql_21_40.pdf", "rb")
merger.append(input1)
merger.append(input2)
# Write to an output PDF document
output = open("hql_all.pdf", "wb")
merger.write(output)

2. requests and beautifulsoup are two major artifacts of crawlers, reuqests is used for network requests, and beautifulsoup is used to operate html data. With these two shuttles, work can be done quickly. We don’t need crawler frameworks like scrapy. Using it on such a small program is a bit overkill. In addition, since you are converting html files to pdf, you must also have corresponding library support. wkhtmltopdf is a very useful tool that can convert html to pdf suitable for multiple platforms. pdfkit is the Python package of wkhtmltopdf. First install the following dependency packages

pip install requests
pip install beautifulsoup4
pip install pdfkit

3. Install wkhtmltopdf

Windows platform directly at http:// wkhtmltopdf.org/downloads.html Download the stable version of wkhtmltopdf and install it. After the installation is completed, add the execution path of the program to the system environment $PATH variable. Otherwise, pdfkit cannot find wkhtmltopdf and the error "No wkhtmltopdf executable found" will appear. Ubuntu and CentOS can be installed directly using the command line

$ sudo apt-get install wkhtmltopdf # ubuntu
$ sudo yum intsall wkhtmltopdf   # centos

3. Data preparation

1. Get the url of each article

def get_url_list():
  """
  获取所有URL目录列表
  :return:
  """
  response = requests.get("http://www.liaoxuefeng.com/wiki/0014316089557264a6b348958f449949df42a6d3a2e542c000")
  soup = BeautifulSoup(response.content, "html.parser")
  menu_tag = soup.find_all(class_="uk-nav uk-nav-side")[1]
  urls = []
  for li in menu_tag.find_all("li"):
    url = "http://www.liaoxuefeng.com" + li.a.get('href')
    urls.append(url)
  return urls

2. Save the HTML of each article using a template through the article url File

html template:

html_template = """
<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
</head>
<body>
{content}
</body>
</html>
"""

Save:

def parse_url_to_html(url, name):
  """
  解析URL,返回HTML内容
  :param url:解析的url
  :param name: 保存的html文件名
  :return: html
  """
  try:
    response = requests.get(url)
    soup = BeautifulSoup(response.content, &#39;html.parser&#39;)
    # 正文
    body = soup.find_all(class_="x-wiki-content")[0]
    # 标题
    title = soup.find(&#39;h4&#39;).get_text()
    # 标题加入到正文的最前面,居中显示
    center_tag = soup.new_tag("center")
    title_tag = soup.new_tag(&#39;h1&#39;)
    title_tag.string = title
    center_tag.insert(1, title_tag)
    body.insert(1, center_tag)
    html = str(body)
    # body中的img标签的src相对路径的改成绝对路径
    pattern = "(<img .*?src=\")(.*?)(\")"
    def func(m):
      if not m.group(3).startswith("http"):
        rtn = m.group(1) + "http://www.liaoxuefeng.com" + m.group(2) + m.group(3)
        return rtn
      else:
        return m.group(1)+m.group(2)+m.group(3)
    html = re.compile(pattern).sub(func, html)
    html = html_template.format(content=html)
    html = html.encode("utf-8")
    with open(name, &#39;wb&#39;) as f:
      f.write(html)
    return name
  except Exception as e:
    logging.error("解析错误", exc_info=True)

3. Convert html to pdf

def save_pdf(htmls, file_name):
  """
  把所有html文件保存到pdf文件
  :param htmls: html文件列表
  :param file_name: pdf文件名
  :return:
  """
  options = {
    &#39;page-size&#39;: &#39;Letter&#39;,
    &#39;margin-top&#39;: &#39;0.75in&#39;,
    &#39;margin-right&#39;: &#39;0.75in&#39;,
    &#39;margin-bottom&#39;: &#39;0.75in&#39;,
    &#39;margin-left&#39;: &#39;0.75in&#39;,
    &#39;encoding&#39;: "UTF-8",
    &#39;custom-header&#39;: [
      (&#39;Accept-Encoding&#39;, &#39;gzip&#39;)
    ],
    &#39;cookie&#39;: [
      (&#39;cookie-name1&#39;, &#39;cookie-value1&#39;),
      (&#39;cookie-name2&#39;, &#39;cookie-value2&#39;),
    ],
    &#39;outline-depth&#39;: 10,
  }
  pdfkit.from_file(htmls, file_name, options=options)

4. Merge the converted single PDFs into one PDF

merger = PdfFileMerger()
for pdf in pdfs:
  merger.append(open(pdf,&#39;rb&#39;))
  print u"合并完成第"+str(i)+&#39;个pdf&#39;+pdf

Full source code:

# coding=utf-8
import os
import re
import time
import logging
import pdfkit
import requests
from bs4 import BeautifulSoup
from PyPDF2 import PdfFileMerger
html_template = """
<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
</head>
<body>
{content}
</body>
</html>
"""
def parse_url_to_html(url, name):
  """
  解析URL,返回HTML内容
  :param url:解析的url
  :param name: 保存的html文件名
  :return: html
  """
  try:
    response = requests.get(url)
    soup = BeautifulSoup(response.content, &#39;html.parser&#39;)
    # 正文
    body = soup.find_all(class_="x-wiki-content")[0]
    # 标题
    title = soup.find(&#39;h4&#39;).get_text()
    # 标题加入到正文的最前面,居中显示
    center_tag = soup.new_tag("center")
    title_tag = soup.new_tag(&#39;h1&#39;)
    title_tag.string = title
    center_tag.insert(1, title_tag)
    body.insert(1, center_tag)
    html = str(body)
    # body中的img标签的src相对路径的改成绝对路径
    pattern = "(<img .*?src=\")(.*?)(\")"
    def func(m):
      if not m.group(3).startswith("http"):
        rtn = m.group(1) + "http://www.liaoxuefeng.com" + m.group(2) + m.group(3)
        return rtn
      else:
        return m.group(1)+m.group(2)+m.group(3)
    html = re.compile(pattern).sub(func, html)
    html = html_template.format(content=html)
    html = html.encode("utf-8")
    with open(name, &#39;wb&#39;) as f:
      f.write(html)
    return name
  except Exception as e:
    logging.error("解析错误", exc_info=True)
def get_url_list():
  """
  获取所有URL目录列表
  :return:
  """
  response = requests.get("http://www.liaoxuefeng.com/wiki/0014316089557264a6b348958f449949df42a6d3a2e542c000")
  soup = BeautifulSoup(response.content, "html.parser")
  menu_tag = soup.find_all(class_="uk-nav uk-nav-side")[1]
  urls = []
  for li in menu_tag.find_all("li"):
    url = "http://www.liaoxuefeng.com" + li.a.get(&#39;href&#39;)
    urls.append(url)
  return urls
def save_pdf(htmls, file_name):
  """
  把所有html文件保存到pdf文件
  :param htmls: html文件列表
  :param file_name: pdf文件名
  :return:
  """
  options = {
    &#39;page-size&#39;: &#39;Letter&#39;,
    &#39;margin-top&#39;: &#39;0.75in&#39;,
    &#39;margin-right&#39;: &#39;0.75in&#39;,
    &#39;margin-bottom&#39;: &#39;0.75in&#39;,
    &#39;margin-left&#39;: &#39;0.75in&#39;,
    &#39;encoding&#39;: "UTF-8",
    &#39;custom-header&#39;: [
      (&#39;Accept-Encoding&#39;, &#39;gzip&#39;)
    ],
    &#39;cookie&#39;: [
      (&#39;cookie-name1&#39;, &#39;cookie-value1&#39;),
      (&#39;cookie-name2&#39;, &#39;cookie-value2&#39;),
    ],
    &#39;outline-depth&#39;: 10,
  }
  pdfkit.from_file(htmls, file_name, options=options)
def main():
  start = time.time()
  file_name = u"liaoxuefeng_Python3_tutorial"
  urls = get_url_list()
  for index, url in enumerate(urls):
   parse_url_to_html(url, str(index) + ".html")
  htmls =[]
  pdfs =[]
  for i in range(0,124):
    htmls.append(str(i)+'.html')
    pdfs.append(file_name+str(i)+'.pdf')
    save_pdf(str(i)+'.html', file_name+str(i)+'.pdf')
    print u"转换完成第"+str(i)+'个html'
  merger = PdfFileMerger()
  for pdf in pdfs:
    merger.append(open(pdf,'rb'))
    print u"合并完成第"+str(i)+'个pdf'+pdf
  output = open(u"廖雪峰Python_all.pdf", "wb")
  merger.write(output)
  print u"输出PDF成功!"
  for html in htmls:
    os.remove(html)
    print u"删除临时文件"+html
  for pdf in pdfs:
    os.remove(pdf)
    print u"删除临时文件"+pdf
  total_time = time.time() - start
  print(u"总共耗时:%f 秒" % total_time)
if __name__ == '__main__':
  main()

Related recommendations:

Python implements simple crawler sharing to capture links on the page

Python implements crawling the website title information of Baidu search results page

The above is the detailed content of Python implements a method to capture HTML web pages and save them as PDF files. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn