Home  >  Article  >  Backend Development  >  Summarize several methods of operating PDF with Python

Summarize several methods of operating PDF with Python

coldplay.xixi
coldplay.xixiforward
2020-10-08 17:50:244048browse

python tutorial column today summarizes several ways to use Python to operate PDF.

Summarize several methods of operating PDF with Python

01

Preface

Hello everyone, the case about Python operating PDF has been written before I have experienced a?PDF Batch Merge. The original intention of this case is just to provide you with a convenient script, and there is not much explanation of the principle. It involves a very practical module for PDF processing, PyPDF2. This article will analyze this module carefully, mainly involving

  • os Comprehensive application of the module
  • glob Comprehensive application of the module
  • PyPDF2 Module operation

02

Basic operation

PyPDF2 The code to import the module is usually:

from PyPDF2 import PdfFileReader, PdfFileWriter复制代码

Two methods are imported here:

  • PdfFileReader can be understood as a reader
  • PdfFileWriter can be understood For the writer

Next, we will further understand the wonders of these two tools through several cases. The sample file used is the pdf of 5 invoices

Summarize several methods of operating PDF with Python

Each invoice PDF consists of two pages:

03

MERGE

One job is to merge 5 invoice pdfs into 10 pages. How should the reader and writer cooperate here?

The logic is as follows:

  • The reader reads all pdfs once
  • The reader hands over the read content to the writer
  • The writer uniformly outputs to a new pdf

There is another important knowledge point here: the reader can only hand over the read content to the writer page by page.

Therefore, steps 1 and 2 in the logic are actually not independent steps , but after the reader reads a pdf, it cycles through all pages of the pdf Once, page by page is handed over to the writer. Finally, wait until all reading work is completed before outputting.

Looking at the code can make the idea clearer:

from PyPDF2 import PdfFileReader, PdfFileWriter

path = r'C:\Users\xxxxxx'
pdf_writer = PdfFileWriter()

for i in range(1, 6):
    pdf_reader = PdfFileReader(path + '/INV{}.pdf'.format(i))
    for page in range(pdf_reader.getNumPages()):
        pdf_writer.addPage(pdf_reader.getPage(page))

with open(path + r'\合并PDF\merge.pdf', 'wb') as out:
    pdf_writer.write(out)复制代码

Since all the content needs to be handed over to the same writer and finally output together, the initialization of the writer must be within the loop body External.

If it is inside the loop body, it will become Every time a pdf is accessed, a new writer is generated, so that each reader is handed over to the writer The content will be repeatedly overwritten, and our merging requirements cannot be achieved!

The code at the beginning of the loop body:

for i in range(1, 6):
    pdf_reader = PdfFileReader(path + '/INV{}.pdf'.format(i))复制代码

The purpose is to read a new one each time it loops The pdf file is handed over to the reader for subsequent operations. In fact, this way of writing is not very recommended. Since the naming of each PDF happens to be very regular, you can directly specify the numbers for looping. A better way is to use the glob module:

import glob
for file in glob.glob(path + '/*.pdf'):
    pdf_reader = PdfFileReader(path)复制代码

pdf_reader.getNumPages(): in the code can get the number of pages in the reader, combined with range can traverse all pages of the reader.

pdf_writer.addPage(pdf_reader.getPage(page)) can hand over the current page to the writer.

Finally, use with to create a new pdf and output it through the pdf_writer.write(out) method of the writer.

04

Split

If you understand the cooperation of readers and writers in the merge operation, then splitting is easy to understand Okay, here we take splitting INV1.pdf into two separate pdf documents as an example. Let’s also walk through the logic first:

  • Reader reads PDF Document
  • The reader hands it over to the writer page by page
  • The writer immediately outputs every page it gets

Through this code logic, we also It can be understood that the initialization and output positions of the writer must be within the loop body of each page of the PDF reading loop, not outside the loop

The code is very simple:

from PyPDF2 import PdfFileReader, PdfFileWriter
path = r'C:\Users\xxx'
pdf_reader = PdfFileReader(path + '\INV1.pdf')

for page in range(pdf_reader.getNumPages()):
    # 遍历到每一页挨个生成写入器
    pdf_writer = PdfFileWriter()
    pdf_writer.addPage(pdf_reader.getPage(page))
    # 写入器被添加一页后立即输出产生pdf
    with open(path + '\INV1-{}.pdf'.format(page + 1), 'wb') as out:
        pdf_writer.write(out)复制代码

05

Watermark

This time the work is to add the following image as a watermark to INV1.pdf

Summarize several methods of operating PDF with Python

The first step is preparation. Insert the picture that needs to be used as a watermark into Word, adjust the appropriate position and save it as a PDF file. Then you can code. You need to use the copy module additionally. See the figure below for a detailed explanation:

Summarize several methods of operating PDF with Python

is to combine the reader and writer Initialize and read the watermarked PDF page first for later use. The core code is a little difficult to understand:

Summarize several methods of operating PDF with Python

Adding watermarks is essentially merging the watermarked PDF page with each page that needs to be watermarked

Because the PDF that needs to be watermarked There may be many pages, but the watermarked PDF only has one page, so if the watermarked PDFs are merged directly, it can be abstractly understood as after adding the first page, the watermarked PDF page will be gone.

Therefore, cannot be merged directly , but the watermarked PDF page must be continuously copy out into a new page for later use new_page, and then use it .mergePage The method completes the merge with each page, and hands the merged page to the writer for final unified output!

About the use of .mergePage:Appears on the page below.mergePage (Appears on the page above), the final effect is as shown below:

06

Encryption

Encryption is very simple, just remember: "Encryption is for writer encryption"

So you only need to call pdf_writer.encrypt (password)# after the relevant operation is completed.

## Take the encryption of a single PDF as an example:

Summarize several methods of operating PDF with Python
is written at the end

except for

PDF Merging, splitting, encryption, and watermarking, we can also usePython combined with Excel and Word to achieve more automation requirements, these are left to the readers to develop by themselves. Python resource sharing Junyang 1075110200, which contains installation packages, PDFs, and learning videos. This is a gathering place for Python learners, both zero-based and advanced are welcome

Finally, I hope everyone can understand Python office automation One core is

batch operation-free your hands and automate complex work!

More related free learning recommendations:

python tutorial (video)

The above is the detailed content of Summarize several methods of operating PDF with Python. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:juejin.im. If there is any infringement, please contact admin@php.cn delete