Home  >  Article  >  Backend Development  >  How to Extract Text from PDF Files using PDFMiner in Python with the Latest API Changes?

How to Extract Text from PDF Files using PDFMiner in Python with the Latest API Changes?

Linda Hamilton
Linda HamiltonOriginal
2024-10-17 14:23:29587browse

How to Extract Text from PDF Files using PDFMiner in Python with the Latest API Changes?

Text Extraction from PDF Files Using PDFMiner in Python

Extracting text from a PDF file is a common task when working with structured data. Python provides the PDFMiner library to facilitate this process. However, recent updates to the PDFMiner API have rendered many previous examples obsolete.

To address this, let's explore a working example of text extraction using the current version of PDFMiner:

<code class="python">from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text</code>

This function takes a PDF file path as input and returns the extracted text as a string. It handles common scenarios such as password-protected PDFs and multi-page documents.

By using the latest version of PDFMiner and implementing this function, you can efficiently extract text from PDF files in your Python applications.

The above is the detailed content of How to Extract Text from PDF Files using PDFMiner in Python with the Latest API Changes?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn