Home >Backend Development >Python Tutorial >Example of using python to output pdf to txt
The following is an example of how to use python to output pdf to txt. It has a good reference value and I hope it will be helpful to everyone. Let’s come and take a look.
A classmate asked me about this a week ago. Since I was participating in Huawei’s competition before, I took a look after the competition. It is said that I need to use the pdfminer package. So I installed it, and the installation process was very simple:
sudo pip install pdfminer;
There were no errors in the middle. As for how to call it, I have not studied the pdfminer library very well, so I started Baidu...
Official documentation:http://www.unixuser .org/~euske/python/pdfminer/index.html
Written entirely in python. (Applicable to version 2.4 or newer)
Parse, analyze, and convert PDF documents.
PDF-1.7 specification support. (Almost)
Chinese, Japanese and Korean languages and vertical writing script support.
Support for various font types (Type1, TrueType, Type3, and CID).
Basic encryption (RC4) support.
PDF and HTML conversion.
Extraction of outline (TOC).
Tag content extraction.
Rebuild the original layout by grouping text blocks.
Some basic classes
PDFParser: Get data from a file
PDFDocument: Save the obtained data, and PDFParser is interrelated
PDFPageInterpreter handles page content
PDFDevice translates it into the format you need
PDFResourceManager is used to store shared resources such as fonts or images.
Simple implementation
Read test.pdf and the output is output.txt:
# -*- coding: utf-8 -*- from pdfminer.pdfparser import PDFParser from pdfminer.pdfdocument import PDFDocument from pdfminer.pdfpage import PDFPage from pdfminer.pdfpage import PDFTextExtractionNotAllowed from pdfminer.pdfinterp import PDFResourceManager from pdfminer.pdfinterp import PDFPageInterpreter from pdfminer.pdfdevice import PDFDevice from pdfminer.layout import * from pdfminer.converter import PDFPageAggregator import os fp = open('test.pdf', 'rb') #来创建一个pdf文档分析器 parser = PDFParser(fp) #创建一个PDF文档对象存储文档结构 document = PDFDocument(parser) # 检查文件是否允许文本提取 if not document.is_extractable: raise PDFTextExtractionNotAllowed else: # 创建一个PDF资源管理器对象来存储共赏资源 rsrcmgr=PDFResourceManager() # 设定参数进行分析 laparams=LAParams() # 创建一个PDF设备对象 # device=PDFDevice(rsrcmgr) device=PDFPageAggregator(rsrcmgr,laparams=laparams) # 创建一个PDF解释器对象 interpreter=PDFPageInterpreter(rsrcmgr,device) # 处理每一页 for page in PDFPage.create_pages(document): interpreter.process_page(page) # 接受该页面的LTPage对象 layout=device.get_result() for x in layout: if(isinstance(x,LTTextBoxHorizontal)): with open('output.txt','a') as f: f.write(x.get_text().encode('utf-8')+'\n')
Related recommendations:
How to convert pdf to images in Python
The above is the detailed content of Example of using python to output pdf to txt. For more information, please follow other related articles on the PHP Chinese website!