How to split PDF documents using PyPDF2 module in Python-Python Tutorial-php.cn

Home

Backend Development

Python Tutorial

How to split PDF documents using PyPDF2 module in Python

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

May 09, 2023 pm 03:34 PM

pythonpdfpypdf2

Install PyPDF2 module

# This module is strictly case-sensitive, y is lowercase, and the rest is uppercase

pip3 install PyPDF2

How to split PDF documents using PyPDF2 module in Python

After the installation is completed, create a folder specifically to store this project on the local hard disk. The storage path here is F:\Python\PyPDF2. There is a Python folder on the F drive, and I created it in it. A folder named after this module to store it separately and distinguish it from other projects.

Create files and prepare PDF documents

How to split PDF documents using PyPDF2 module in Python

Looking for a larger PDF document for practice, I downloaded it from the Django official website This document is large enough, with more than 1,900 pages, which is definitely enough for practicing. If necessary, go to the official website to download, or directly reply 'pdf' on my official account to get the download link, and then create a PDFCF.py project file .

Start writing

The program starts with two lines and writes the two sentences above and below. The first sentence means to specify the running program of this file. The second sentence This sentence is a description of this file. The function of this cannot be seen yet, but if you know how to quickly execute programs in batches, you will know its function. I will not go into details here.

#! python# PDFCF.py - pdf文件拆分程序

The idea of document splitting

It is not fixed how many parts it is split into, but it is fixed how many pages each part consists of , and then dynamically calculate the number of splits. Once you have the idea of splitting, the next step is to list the calculation formula.

拆分的份数= 文档总页数 / 拆份每个pdf组成的页数

For example:

If we want to split a pdf document with a total of 35 pages, it will be composed of 10 pages each For a new document, the calculation formula for how many parts it can be split into is as follows:

3.5 = 35 / 10

At this time, everyone pays attention. If the remainder is 0.5, what does it mean? Using this example, it means that there are 5 pages left after splitting into 3 parts. In this case, no matter what the remainder is, you have to move forward by 1 to complete the entire split. The result of this document split is that the first 3 documents Each document consists of 10 pages, and the fourth document consists of the last 5 pages. If it is divisible, the result is directly the number of split copies.

Python split calculation formula:

if 35 % 10:   # 判断是否有余数  35 // 10 + 1   # 取余数整数部分加1else:  0         # 能整除则直接返回0  # 将这个循环写到一行4 = 35 // 10 + 1 if 35 % 10 else 0

How to split it specifically?

Let’s take this 35-page document split as an example:

Loop through each page of data for num in range(35), get the data of each page, and then specify the split page range to split:

The first document starts from 0- -10, excluding 10
The second document is from 10--20, excluding 20
The third document is from 20 - 30, not including 30
The fourth document is from 30--35, not including 35

We found the pattern, each time we traverse the The rule of a number is the number of pages in a document, which can be obtained by multiplying the number it belongs to. We found that there is no pattern in the second number. In fact, there is a pattern if we observe carefully. If we sort the number of splits, this example is 1--4. The second number is the current number of splits multiplied by each The number of pages the document consists of (the number of pages is fixed at 10).

But when we traverse for the first time, we start from 0, which makes num unusable. Then we modify it and start traversing from 1, range(1,35), traverse from the beginning, based on the range is not Contains the last characteristic of itself, so that one page of documents will be missing after traversing, then we add 1 to it and become

for num in range(1,35 1 )
The first document starts from 10*(1-1)--10*1, excluding 10
The two documents are from 10*(2-1)--10*2, not including 20
The third document is from 10*(3-1)-10*3, not The fourth document containing 30
## is from 10(4-1)--35

The specific traversal code is as follows:

for num in range(1,35+1):  pass  for i in range(10 * (num-1), 10 * num if num != 4 else 35):    pass

Note: When traversing to num = 4 (the last document sort number), just return the total number of pages 35, and the traversal ends here. . Why is the total number of pages here 35 instead of 35 1? This is because we are traversing from 0 this time, and the page number starts from 0, so there is no need to add 1.

Complete splitting procedure:

import PyPDF2

Note: I personally feel that the splitting idea above is a bit convoluted. If you are interested in If you have a thorough understanding of the concepts of edge trimming and step size in Python lists, I don’t think it needs to be so complicated. You only need to generate a large list of the total page numbers, and then split the list into multiple small lists using the slicing method, and then split each list. The divided pdf page number range is the first number of each small list - the last number 1. I also posted the code I implemented using the list method for your reference.

Split list method to split PDF:

#! python

How to use?

How to split PDF documents using PyPDF2 module in Python

Hold down the Shift key inside the project folder, right-click the mouse, choose to open the command window here, enter PDFCF.py, and press Enter to change it according to your needs The value of n.

How to split PDF documents using PyPDF2 module in Python

The above is the detailed content of How to split PDF documents using PyPDF2 module in Python. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:亿速云. If there is any infringement, please contact admin@php.cn delete

详细讲解Python之Seaborn（数据可视化）Apr 21, 2022 pm 06:08 PM

本篇文章给大家带来了关于Python的相关知识，其中主要介绍了关于Seaborn的相关问题，包括了数据可视化处理的散点图、折线图、条形图等等内容，下面一起来看一下，希望对大家有帮助。

详细了解Python进程池与进程锁May 10, 2022 pm 06:11 PM

本篇文章给大家带来了关于Python的相关知识，其中主要介绍了关于进程池与进程锁的相关问题，包括进程池的创建模块，进程池函数等等内容，下面一起来看一下，希望对大家有帮助。

Python自动化实践之筛选简历Jun 07, 2022 pm 06:59 PM

本篇文章给大家带来了关于Python的相关知识，其中主要介绍了关于简历筛选的相关问题，包括了定义 ReadDoc 类用以读取 word 文件以及定义 search_word 函数用以筛选的相关内容，下面一起来看一下，希望对大家有帮助。

归纳总结Python标准库May 03, 2022 am 09:00 AM

本篇文章给大家带来了关于Python的相关知识，其中主要介绍了关于标准库总结的相关问题，下面一起来看一下，希望对大家有帮助。

分享10款高效的VSCode插件，总有一款能够惊艳到你！！Mar 09, 2021 am 10:15 AM

VS Code的确是一款非常热门、有强大用户基础的一款开发工具。本文给大家介绍一下10款高效、好用的插件，能够让原本单薄的VS Code如虎添翼，开发效率顿时提升到一个新的阶段。

Python数据类型详解之字符串、数字Apr 27, 2022 pm 07:27 PM

本篇文章给大家带来了关于Python的相关知识，其中主要介绍了关于数据类型之字符串、数字的相关问题，下面一起来看一下，希望对大家有帮助。

详细介绍python的numpy模块May 19, 2022 am 11:43 AM

本篇文章给大家带来了关于Python的相关知识，其中主要介绍了关于numpy模块的相关问题，Numpy是Numerical Python extensions的缩写，字面意思是Python数值计算扩展，下面一起来看一下，希望对大家有帮助。

python中文是什么意思Jun 24, 2019 pm 02:22 PM

pythn的中文意思是巨蟒、蟒蛇。1989年圣诞节期间，Guido van Rossum在家闲的没事干，为了跟朋友庆祝圣诞节，决定发明一种全新的脚本语言。他很喜欢一个肥皂剧叫Monty Python，所以便把这门语言叫做python。

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

Repo: How To Revive Teammates

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

2 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hello Kitty Island Adventure: How To Get Giant Seeds

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

How Long Does It Take To Beat Split Fiction?

4 weeks agoByDDD

R.E.P.O. Save File Location: Where Is It & How to Protect It?

4 weeks agoByDDD

Hot Tools

MantisBT

Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),