Python method to parse and read the contents of PDF files
This article mainly introduces the method of Python parsing and reading the content of PDF files. It also describes the relevant operating techniques of Python2.7 in win32 and win64 environments to read PDF in the form of examples. Friends in need can refer to the following
The examples in this article describe how Python parses and reads the contents of PDF files. Share it with everyone for your reference, the details are as follows:
1. Problem description
Use python to read the PDF text content.
2. Effect
# #3. Running environment
python2.74. Libraries that need to be installed
pip install pdfminer
5. Implementation source code
Code 1 (win64)
# coding=utf-8 import sys reload(sys) sys.setdefaultencoding('utf-8') import time time1=time.time() import os.path from pdfminer.pdfparser import PDFParser,PDFDocument from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.converter import PDFPageAggregator from pdfminer.layout import LTTextBoxHorizontal,LAParams from pdfminer.pdfinterp import PDFTextExtractionNotAllowed result=[] class CPdf2TxtManager(): def __init__(self): ''''' Constructor ''' def changePdfToText(self, filePath): file = open(path, 'rb') # 以二进制读模式打开 #用文件对象来创建一个pdf文档分析器 praser = PDFParser(file) # 创建一个PDF文档 doc = PDFDocument() # 连接分析器 与文档对象 praser.set_document(doc) doc.set_parser(praser) # 提供初始化密码 # 如果没有密码 就创建一个空的字符串 doc.initialize() # 检测文档是否提供txt转换,不提供就忽略 if not doc.is_extractable: raise PDFTextExtractionNotAllowed # 创建PDf 资源管理器 来管理共享资源 rsrcmgr = PDFResourceManager() # 创建一个PDF设备对象 laparams = LAParams() device = PDFPageAggregator(rsrcmgr, laparams=laparams) # 创建一个PDF解释器对象 interpreter = PDFPageInterpreter(rsrcmgr, device) pdfStr = '' # 循环遍历列表,每次处理一个page的内容 for page in doc.get_pages(): # doc.get_pages() 获取page列表 interpreter.process_page(page) # 接受该页面的LTPage对象 layout = device.get_result() for x in layout: if hasattr(x, "get_text"): # print x.get_text() result.append(x.get_text()) fileNames = os.path.splitext(filePath) with open(fileNames[0] + '.txt','wb') as f: results = x.get_text() print(results) f.write(results + '\n') if __name__ == '__main__': ''''' 解析pdf 文本,保存到txt文件中 ''' path = u'C:/data3.pdf' pdf2TxtManager = CPdf2TxtManager() pdf2TxtManager.changePdfToText(path) # print result[0] time2 = time.time() print u'ok,解析pdf结束!' print u'总共耗时:' + str(time2 - time1) + 's'
Code 2 (win32)
# coding=utf-8 import sys reload(sys) sys.setdefaultencoding('utf-8') import time time1=time.time() import os.path from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.converter import PDFPageAggregator from pdfminer.layout import LAParams from pdfminer.pdfpage import PDFTextExtractionNotAllowed from pdfminer.pdfparser import PDFParser from pdfminer.pdfdocument import PDFDocument from pdfminer.pdfpage import PDFPage result=[] class CPdf2TxtManager(): def __init__(self): ''''' Constructor ''' def changePdfToText(self, filePath): file = open(path, 'rb') # 以二进制读模式打开 #用文件对象来创建一个pdf文档分析器 praser = PDFParser(file) # 创建一个PDF文档 doc = PDFDocument(praser) # 检测文档是否提供txt转换,不提供就忽略 if not doc.is_extractable: raise PDFTextExtractionNotAllowed # 创建PDf 资源管理器 来管理共享资源 rsrcmgr = PDFResourceManager() # 创建一个PDF设备对象 laparams = LAParams() device = PDFPageAggregator(rsrcmgr, laparams=laparams) # 创建一个PDF解释器对象 interpreter = PDFPageInterpreter(rsrcmgr, device) pdfStr = '' # 循环遍历列表,每次处理一个page的内容 for page in PDFPage.create_pages(doc): # doc.get_pages() 获取page列表 interpreter.process_page(page) # 接受该页面的LTPage对象 layout = device.get_result() for x in layout: if hasattr(x, "get_text"): # print x.get_text() result.append(x.get_text()) fileNames = os.path.splitext(filePath) with open(fileNames[0] + '.txt','wb') as f: results = x.get_text() print(results) f.write(results + '\n') if __name__ == '__main__': ''''' 解析pdf 文本,保存到txt文件中 ''' path = u'C:/36.pdf' pdf2TxtManager = CPdf2TxtManager() pdf2TxtManager.changePdfToText(path) # print result[0] time2 = time.time() print u'ok,解析pdf结束!' print u'总共耗时:' + str(time2 - time1) + 's'Related recommendations:
Python implements the method of grabbing HTML web pages and saving them as PDF files
The above is the detailed content of Python method to parse and read the contents of PDF files. For more information, please follow other related articles on the PHP Chinese website!

The basic syntax for Python list slicing is list[start:stop:step]. 1.start is the first element index included, 2.stop is the first element index excluded, and 3.step determines the step size between elements. Slices are not only used to extract data, but also to modify and invert lists.

Listsoutperformarraysin:1)dynamicsizingandfrequentinsertions/deletions,2)storingheterogeneousdata,and3)memoryefficiencyforsparsedata,butmayhaveslightperformancecostsincertainoperations.

ToconvertaPythonarraytoalist,usethelist()constructororageneratorexpression.1)Importthearraymoduleandcreateanarray.2)Uselist(arr)or[xforxinarr]toconvertittoalist,consideringperformanceandmemoryefficiencyforlargedatasets.

ChoosearraysoverlistsinPythonforbetterperformanceandmemoryefficiencyinspecificscenarios.1)Largenumericaldatasets:Arraysreducememoryusage.2)Performance-criticaloperations:Arraysofferspeedboostsfortaskslikeappendingorsearching.3)Typesafety:Arraysenforc

In Python, you can use for loops, enumerate and list comprehensions to traverse lists; in Java, you can use traditional for loops and enhanced for loops to traverse arrays. 1. Python list traversal methods include: for loop, enumerate and list comprehension. 2. Java array traversal methods include: traditional for loop and enhanced for loop.

The article discusses Python's new "match" statement introduced in version 3.10, which serves as an equivalent to switch statements in other languages. It enhances code readability and offers performance benefits over traditional if-elif-el

Exception Groups in Python 3.11 allow handling multiple exceptions simultaneously, improving error management in concurrent scenarios and complex operations.

Function annotations in Python add metadata to functions for type checking, documentation, and IDE support. They enhance code readability, maintenance, and are crucial in API development, data science, and library creation.


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

SublimeText3 English version
Recommended: Win version, supports code prompts!

ZendStudio 13.5.1 Mac
Powerful PHP integrated development environment

MantisBT
Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

Dreamweaver CS6
Visual web development tools

mPDF
mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),
