Convert data captured by python crawler into PDF-Python Tutorial-php.cn

Home

Backend Development

Python Tutorial

Convert data captured by python crawler into PDF

Y2J

May 08, 2017 pm 04:56 PM

This article shares with you the method and code of using python crawler to convert "Liao Xuefeng's Python Tutorial" into PDF. Friends in need can refer to it.

It seems that there is no easier way to write a crawler than using Python. It's appropriate. There are so many crawler tools provided by the Python community that you will be dazzled. With various libraries that can be used directly, you can write a crawler in minutes. Today I am thinking about writing a crawler and crawling down Liao Xuefeng's Python tutorial. Create a PDF e-book for everyone to read offline.

Before we start writing the crawler, let’s first analyze the page structure of the website 1. The left side of the web page is the directory outline of the tutorial. Each URL corresponds to an article on the right. The upper right side is the article’s The title, in the middle is the text part of the article. The text content is the focus of our concern. The data we want to crawl is the text part of all web pages. Below is the user's comment area. The comment area is of no use to us, so we can ignore it.

Tool preparation

After you have figured out the basic structure of the website, you can start preparing the tool kits that the crawler depends on. requests and beautifulsoup are two artifacts of crawlers, reuqests is used for network requests, and beautifusoup is used to operate html data. With these two shuttles, we can work quickly. We don’t need crawlers like scrapyframework. Using it in small programs is like killing a chicken with a sledgehammer. In addition, since you are converting html files to pdf, you must also have corresponding library support. wkhtmltopdf is a very good tool, which can convert html to pdf for multiple platforms. pdfkit is the Python package of wkhtmltopdf. FirstInstallthe following dependency packages,

Then install wkhtmltopdf

pip install requests
pip install beautifulsoup
pip install pdfkit

Install wkhtmltopdf

Windows platform directly on the wkhtmltopdf official website 2 Download the stable version and install it. After the installation is completed, add the execution path of the program to the system environment $PATH variable , otherwise pdfkit cannot find wkhtmltopdf and the error "No wkhtmltopdf executable found" will appear. Ubuntu and CentOS can be installed directly using the command line

$ sudo apt-get install wkhtmltopdf # ubuntu
$ sudo yum intsall wkhtmltopdf   # centos

Crawler implementation

After everything is ready, you can start coding, but you should sort out your thoughts before writing code . The purpose of the program is to save the html text parts corresponding to all URLs locally, and then use pdfkit to convert these files into a pdf file. Let's split the task. First, save the html text corresponding to a certain URL locally, and then find all URLs and perform the same operation.

Use the Chrome browser to find the tag in the body part of the page, and press F12 to find the p tag corresponding to the body: <p></p>, where p is the body content of the web page. After using requests to load the entire page locally, you can use beautifulsoup to operate the HTML dom element to extract the text content.

The specific implementation code is as follows: Use soup.find_all function to find the text tag, and then save the content of the text part to the a.html file.

def parse_url_to_html(url):
  response = requests.get(url)
  soup = BeautifulSoup(response.content, "html5lib")
  body = soup.find_all(class_="x-wiki-content")[0]
  html = str(body)
  with open("a.html", &#39;wb&#39;) as f:
    f.write(html)

The second step is to parse out all the URLs on the left side of the page. Use the same method to find the left menu label <ul></ul>

##Specific code implementation logic: because there are two uk-nav on the page The class

attribute of uk-nav-side, and the real directory listing is the second one. All URLs have been obtained, and the function for converting URLs to HTML has been written in the first step.

def get_url_list():
  """
  获取所有URL目录列表
  """
  response = requests.get("http://www.liaoxuefeng.com/wiki/0014316089557264a6b348958f449949df42a6d3a2e542c000")
  soup = BeautifulSoup(response.content, "html5lib")
  menu_tag = soup.find_all(class_="uk-nav uk-nav-side")[1]
  urls = []
  for li in menu_tag.find_all("li"):
    url = "http://www.liaoxuefeng.com" + li.a.get(&#39;href&#39;)
    urls.append(url)
  return urls

The last step is to convert the html into a pdf file. Converting to a pdf file is very simple, because pdfkit has encapsulated all the logic. You only need to call the function pdfkit.from_file

def save_pdf(htmls):
  """
  把所有html文件转换成pdf文件
  """
  options = {
    &#39;page-size&#39;: &#39;Letter&#39;,
    &#39;encoding&#39;: "UTF-8",
    &#39;custom-header&#39;: [
      (&#39;Accept-Encoding&#39;, &#39;gzip&#39;)
    ]
  }
  pdfkit.from_file(htmls, file_name, options=options)

to execute the save_pdf function, and the e-book pdf file will be generated. The rendering: <p></p>

Summary

The total amount of code adds up to less than 50 lines. However, wait, in fact, the code given above omits some details. , for example, how to get the title of the article, the img tag of the text content uses a relative path, if you want to display the

picture normally in the pdf, you need to change the relative path to an absolute path, and save it Temporary html files must be delete, and these details are all posted on github.

【related suggestion】

1. Python Free Video Tutorial

2. Python Object-Oriented Video Tutorial

3. Python Learning Manual

The above is the detailed content of Convert data captured by python crawler into PDF. For more information, please follow other related articles on the PHP Chinese website!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Understanding the Difference: For Loop and While Loop in PythonMay 16, 2025 am 12:17 AM

ThedifferencebetweenaforloopandawhileloopinPythonisthataforloopisusedwhenthenumberofiterationsisknowninadvance,whileawhileloopisusedwhenaconditionneedstobecheckedrepeatedlywithoutknowingthenumberofiterations.1)Forloopsareidealforiteratingoversequence

Python Loop Control: For vs While - A ComparisonMay 16, 2025 am 12:16 AM

In Python, for loops are suitable for cases where the number of iterations is known, while loops are suitable for cases where the number of iterations is unknown and more control is required. 1) For loops are suitable for traversing sequences, such as lists, strings, etc., with concise and Pythonic code. 2) While loops are more appropriate when you need to control the loop according to conditions or wait for user input, but you need to pay attention to avoid infinite loops. 3) In terms of performance, the for loop is slightly faster, but the difference is usually not large. Choosing the right loop type can improve the efficiency and readability of your code.

How to Combine Two Lists in Python: 5 Easy WaysMay 16, 2025 am 12:16 AM

In Python, lists can be merged through five methods: 1) Use operators, which are simple and intuitive, suitable for small lists; 2) Use extend() method to directly modify the original list, suitable for lists that need to be updated frequently; 3) Use list analytical formulas, concise and operational on elements; 4) Use itertools.chain() function to efficient memory and suitable for large data sets; 5) Use * operators and zip() function to be suitable for scenes where elements need to be paired. Each method has its specific uses and advantages and disadvantages, and the project requirements and performance should be taken into account when choosing.

For Loop vs While Loop: Python Syntax, Use Cases & ExamplesMay 16, 2025 am 12:14 AM

Forloopsareusedwhenthenumberofiterationsisknown,whilewhileloopsareuseduntilaconditionismet.1)Forloopsareidealforsequenceslikelists,usingsyntaxlike'forfruitinfruits:print(fruit)'.2)Whileloopsaresuitableforunknowniterationcounts,e.g.,'whilecountdown>

Python concatenate list of listsMay 16, 2025 am 12:08 AM

ToconcatenatealistoflistsinPython,useextend,listcomprehensions,itertools.chain,orrecursivefunctions.1)Extendmethodisstraightforwardbutverbose.2)Listcomprehensionsareconciseandefficientforlargerdatasets.3)Itertools.chainismemory-efficientforlargedatas

Merging Lists in Python: Choosing the Right MethodMay 14, 2025 am 12:11 AM

TomergelistsinPython,youcanusethe operator,extendmethod,listcomprehension,oritertools.chain,eachwithspecificadvantages:1)The operatorissimplebutlessefficientforlargelists;2)extendismemory-efficientbutmodifiestheoriginallist;3)listcomprehensionoffersf

How to concatenate two lists in python 3?May 14, 2025 am 12:09 AM

In Python 3, two lists can be connected through a variety of methods: 1) Use operator, which is suitable for small lists, but is inefficient for large lists; 2) Use extend method, which is suitable for large lists, with high memory efficiency, but will modify the original list; 3) Use * operator, which is suitable for merging multiple lists, without modifying the original list; 4) Use itertools.chain, which is suitable for large data sets, with high memory efficiency.

Python concatenate list stringsMay 14, 2025 am 12:08 AM

Using the join() method is the most efficient way to connect strings from lists in Python. 1) Use the join() method to be efficient and easy to read. 2) The cycle uses operators inefficiently for large lists. 3) The combination of list comprehension and join() is suitable for scenarios that require conversion. 4) The reduce() method is suitable for other types of reductions, but is inefficient for string concatenation. The complete sentence ends.

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Roblox: Grow A Garden - Complete Mutation Guide

4 weeks agoByDDD

Roblox: Bubble Gum Simulator Infinity - How To Get And Use Royal Keys

4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Nordhold: Fusion System, Explained

4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Mandragora: Whispers Of The Witch Tree - How To Unlock The Grappling Hook

4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Clair Obscur: Expedition 33 - How To Get Perfect Chroma Catalysts

2 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

PhpStorm Mac version

The latest (2018.2.1) professional PHP integrated development tool

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.