Home >Backend Development >Python Tutorial >How to extract PDF text in python
This article shows you how to use Python to extract the text content of many PDF files in batches.
First, we read in some modules to perform file operations. (Recommended learning: Python video tutorial)
import glob import os
There are two folders in the demo directory, namely pdf and newpdf.
We specify the path where the pdf file is located as the pdf folder.
pdf_path = "pdf/"
We want to get the path of all pdf files. With glob, this function can be completed with one command.
pdfs = glob.glob("{}/*.pdf".format(pdf_path))
See if the pdf file path we obtained is correct.
pdfs
['pdf/复杂系统仿真的微博客虚假信息扩散模型研究.pdf', 'pdf/面向影子分析的社交媒体竞争情报搜集.pdf', 'pdf/面向人机协同的移动互联网政务门户探析.pdf']
Verified. Accurate.
Below we use pdfminer to extract content from pdf files. We need to read in the function extract_pdf_content from the helper Python file pdf_extractor.py.
from pdf_extractor import extract_pdf_content
Using this function, we try to extract the content from the first article in the pdf file list and save the text in the content variable.
content = extract_pdf_content(pdfs[0])
Obviously, the content extraction is not perfect, headers, footers and other information are mixed in. However, for many of our text analysis uses this will not matter.
For more Python related technical articles, please visit the Python Tutorial column to learn!
The above is the detailed content of How to extract PDF text in python. For more information, please follow other related articles on the PHP Chinese website!