如何在 Python 中使用更新的 PDFMiner API 從 PDF 檔案中提取文字？-Python教學-PHP中文網

首頁

後端開發

Python教學

如何在 Python 中使用更新的 PDFMiner API 從 PDF 檔案中提取文字？

Mary-Kate Olsen

Oct 17, 2024 pm 02:25 PM

How to Extract Text from PDF Files Using Updated PDFMiner API in Python?

Extracting Text from PDF Files with PDFMiner in Python

When working with PDF documents, extracting text can be a crucial task. PDFMiner, a Python library, simplifies this process, enabling developers to parse and extract text from PDF files.

Updated PDFMiner API and Outdated Examples

Recent updates to PDFMiner have introduced changes to its API, rendering many existing examples obsolete. The transition to the latest version can leave developers lost, unsure how to perform basic tasks like text extraction.

Example Implementation

To address this issue, let's explore a working example that demonstrates how to extract text from a PDF file using the current PDFMiner library:

<code class="python">from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text</code>

This code provides a comprehensive approach to text extraction, covering all necessary steps. The convert_pdf_to_txt function takes a file path as input and handles the process of opening the file, initializing the document parser, and converting page content into a text string.

This example illustrates the updated PDFMiner syntax, eliminating the need for outdated code. It has been thoroughly tested and validated for use with the latest PDFMiner version.

以上是如何在 Python 中使用更新的 PDFMiner API 從 PDF 檔案中提取文字？的詳細內容。更多資訊請關注PHP中文網其他相關文章！

陳述

本文內容由網友自願投稿，版權歸原作者所有。本站不承擔相應的法律責任。如發現涉嫌抄襲或侵權的內容，請聯絡admin@php.cn

Python是否列表動態陣列或引擎蓋下的鏈接列表？May 07, 2025 am 12:16 AM

pythonlistsareimplementedasdynamicarrays，notlinkedlists.1）他們areStoredIncoNtiguulMemoryBlocks，mayrequireRealLealLocationWhenAppendingItems，EmpactingPerformance.2）LinkesedlistSwoldOfferefeRefeRefeRefeRefficeInsertions/DeletionsButslowerIndexeDexedAccess，Lestpypytypypytypypytypy

如何從python列表中刪除元素？May 07, 2025 am 12:15 AM

pythonoffersFourmainMethodStoreMoveElement Fromalist：1）刪除（值）emovesthefirstoccurrenceofavalue，2）pop（index）emovesanderturnsanelementataSpecifiedIndex，3）delstatementremoveselemsbybybyselementbybyindexorslicebybyindexorslice，and 4）

試圖運行腳本時，應該檢查是否會遇到'權限拒絕”錯誤？May 07, 2025 am 12:12 AM

toresolvea“ dermissionded”錯誤Whenrunningascript，跟隨台詞：1）CheckAndAdjustTheScript'Spermissions ofchmod xmyscript.shtomakeitexecutable.2）nesureThEseRethEserethescriptistriptocriptibationalocatiforecationAdirectorywherewhereyOuhaveWritePerMissionsyOuhaveWritePermissionsyYouHaveWritePermissions，susteSyAsyOURHomeRecretectory。

與Python的圖像處理中如何使用陣列？May 07, 2025 am 12:04 AM

ArraysarecrucialinPythonimageprocessingastheyenableefficientmanipulationandanalysisofimagedata.1)ImagesareconvertedtoNumPyarrays,withgrayscaleimagesas2Darraysandcolorimagesas3Darrays.2)Arraysallowforvectorizedoperations,enablingfastadjustmentslikebri

對於哪些類型的操作，陣列比列表要快得多？May 07, 2025 am 12:01 AM

ArraySaresificatificallyfasterthanlistsForoperationsBenefiting fromDirectMemoryAcccccccCesandFixed-Sizestructures.1）conscessingElements：arraysprovideconstant-timeaccessduetocontoconcotigunmorystorage.2）iteration：araysleveragececacelocality.3）

說明列表和數組之間元素操作的性能差異。May 06, 2025 am 12:15 AM

ArraySareBetterForlement-WiseOperationsDuetofasterAccessCessCessCessCessCessCessCessAndOptimizedImplementations.1）ArrayshaveContiguucuulmemoryfordirectAccesscess.2）列出sareflexible butslible butslowerduetynemicizing.3）

如何有效地對整個Numpy陣列進行數學操作？May 06, 2025 am 12:15 AM

在NumPy中进行整个数组的数学运算可以通过向量化操作高效实现。1)使用简单运算符如加法（arr 2）可对数组进行运算。2)NumPy使用C语言底层库，提升了运算速度。3)可以进行乘法、除法、指数等复杂运算。4)需注意广播操作，确保数组形状兼容。5)使用NumPy函数如np.sum()能显著提高性能。

您如何將元素插入python數組中？May 06, 2025 am 12:14 AM

在Python中，向列表插入元素有兩種主要方法：1)使用insert(index,value)方法，可以在指定索引處插入元素，但在大列表開頭插入效率低；2)使用append(value)方法，在列表末尾添加元素，效率高。對於大列表，建議使用append()或考慮使用deque或NumPy數組來優化性能。

See all articles