Home  >  Article  >  Backend Development  >  Example of using python to output pdf to txt

Example of using python to output pdf to txt

不言
不言Original
2018-04-23 15:16:592177browse

The following is an example of how to use python to output pdf to txt. It has a good reference value and I hope it will be helpful to everyone. Let’s come and take a look.

A classmate asked me about this a week ago. Since I was participating in Huawei’s competition before, I took a look after the competition. It is said that I need to use the pdfminer package. So I installed it, and the installation process was very simple:

sudo pip install pdfminer;

There were no errors in the middle. As for how to call it, I have not studied the pdfminer library very well, so I started Baidu...

Official documentation:http://www.unixuser .org/~euske/python/pdfminer/index.html

Written entirely in python. (Applicable to version 2.4 or newer)

Parse, analyze, and convert PDF documents.

PDF-1.7 specification support. (Almost)

Chinese, Japanese and Korean languages ​​and vertical writing script support.

Support for various font types (Type1, TrueType, Type3, and CID).

Basic encryption (RC4) support.

PDF and HTML conversion.

Extraction of outline (TOC).

Tag content extraction.

Rebuild the original layout by grouping text blocks.

Some basic classes

PDFParser: Get data from a file

PDFDocument: Save the obtained data, and PDFParser is interrelated

PDFPageInterpreter handles page content

PDFDevice translates it into the format you need

PDFResourceManager is used to store shared resources such as fonts or images.

Simple implementation

Read test.pdf and the output is output.txt:

# -*- coding: utf-8 -*-  
from pdfminer.pdfparser import PDFParser 
from pdfminer.pdfdocument import PDFDocument 
from pdfminer.pdfpage import PDFPage 
from pdfminer.pdfpage import PDFTextExtractionNotAllowed 
from pdfminer.pdfinterp import PDFResourceManager 
from pdfminer.pdfinterp import PDFPageInterpreter 
from pdfminer.pdfdevice import PDFDevice 
from pdfminer.layout import * 
from pdfminer.converter import PDFPageAggregator 
import os 
fp = open('test.pdf', 'rb') 
#来创建一个pdf文档分析器 
parser = PDFParser(fp) 
#创建一个PDF文档对象存储文档结构 
document = PDFDocument(parser) 
# 检查文件是否允许文本提取 
if not document.is_extractable: 
 raise PDFTextExtractionNotAllowed 
else: 
 # 创建一个PDF资源管理器对象来存储共赏资源 
 rsrcmgr=PDFResourceManager() 
 # 设定参数进行分析 
 laparams=LAParams() 
 # 创建一个PDF设备对象 
 # device=PDFDevice(rsrcmgr) 
 device=PDFPageAggregator(rsrcmgr,laparams=laparams) 
 # 创建一个PDF解释器对象 
 interpreter=PDFPageInterpreter(rsrcmgr,device) 
 # 处理每一页 
 for page in PDFPage.create_pages(document): 
  interpreter.process_page(page) 
  # 接受该页面的LTPage对象 
  layout=device.get_result() 
  for x in layout: 
   if(isinstance(x,LTTextBoxHorizontal)): 
    with open('output.txt','a') as f: 
     f.write(x.get_text().encode('utf-8')+'\n')

Related recommendations:

How to convert pdf to images in Python


The above is the detailed content of Example of using python to output pdf to txt. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn