Home  >  Article  >  Backend Development  >  How to extract PDF text in python

How to extract PDF text in python

(*-*)浩
(*-*)浩Original
2019-06-29 11:42:216003browse

This article shows you how to use Python to extract the text content of many PDF files in batches.

How to extract PDF text in python

First, we read in some modules to perform file operations. (Recommended learning: Python video tutorial)

import glob
import os

There are two folders in the demo directory, namely pdf and newpdf.

We specify the path where the pdf file is located as the pdf folder.

pdf_path = "pdf/"

We want to get the path of all pdf files. With glob, this function can be completed with one command.

pdfs = glob.glob("{}/*.pdf".format(pdf_path))

See if the pdf file path we obtained is correct.

pdfs
['pdf/复杂系统仿真的微博客虚假信息扩散模型研究.pdf',
'pdf/面向影子分析的社交媒体竞争情报搜集.pdf',
'pdf/面向人机协同的移动互联网政务门户探析.pdf']

Verified. Accurate.

Below we use pdfminer to extract content from pdf files. We need to read in the function extract_pdf_content from the helper Python file pdf_extractor.py.

from pdf_extractor import extract_pdf_content

Using this function, we try to extract the content from the first article in the pdf file list and save the text in the content variable.

content = extract_pdf_content(pdfs[0])

Obviously, the content extraction is not perfect, headers, footers and other information are mixed in. However, for many of our text analysis uses this will not matter.

For more Python related technical articles, please visit the Python Tutorial column to learn!

The above is the detailed content of How to extract PDF text in python. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn