如何在 Python 中使用 PDFMiner 從 PDF 中提取文字？-Python教學-PHP中文網

首頁

後端開發

Python教學

如何在 Python 中使用 PDFMiner 從 PDF 中提取文字？

Patricia Arquette

Oct 17, 2024 pm 02:26 PM

How to Extract Text from PDFs with PDFMiner in Python?

Extracting Text from PDFs with PDFMiner in Python

Question:

How can I extract text from a PDF file using PDFMiner in Python?

Answer:

Due to recent updates in PDFMiner's API, some existing documentation may contain outdated code. To extract text from a PDF file using the latest version of PDFMiner, follow these steps:

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

def extract_pdf_text(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos = set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password, caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text

This updated code addresses the changes in PDFMiner's syntax. It successfully extracts text from PDF files, as verified with Python 3.x, 3.7, and October 3, 2019 Python 3.7 using pdfminer.six, released in November 2018.

以上是如何在 Python 中使用 PDFMiner 從 PDF 中提取文字？的詳細內容。更多資訊請關注PHP中文網其他相關文章！

陳述

本文內容由網友自願投稿，版權歸原作者所有。本站不承擔相應的法律責任。如發現涉嫌抄襲或侵權的內容，請聯絡admin@php.cn

Python：編譯器還是解釋器？May 13, 2025 am 12:10 AM

Python是解釋型語言，但也包含編譯過程。 1）Python代碼先編譯成字節碼。 2）字節碼由Python虛擬機解釋執行。 3）這種混合機制使Python既靈活又高效，但執行速度不如完全編譯型語言。

python用於循環與循環時：何時使用哪個？May 13, 2025 am 12:07 AM

UseeAforloopWheniteratingOveraseQuenceOrforAspecificnumberoftimes; useAwhiLeLoopWhenconTinuingUntilAcIntiment.forloopsareIdealForkNownsences，而WhileLeleLeleLeleLeleLoopSituationSituationsItuationsItuationSuationSituationswithUndEtermentersitations。

Python循環：最常見的錯誤May 13, 2025 am 12:07 AM

pythonloopscanleadtoerrorslikeinfiniteloops，modifyingListsDuringteritation，逐個偏置，零indexingissues，andnestedloopineflinefficiencies

對於循環和python中的循環時：每個循環的優點是什麼？May 13, 2025 am 12:01 AM

forloopsareadvantageousforknowniterations and sequests，供應模擬性和可讀性；而LileLoopSareIdealFordyNamicConcitionSandunknowniterations，提供ControloperRoverTermination.1）forloopsareperfectForeTectForeTerToratingOrtratingRiteratingOrtratingRitterlistlistslists，callings conspass，calplace，cal，ofstrings ofstrings，orstrings，orstrings，orstrings ofcces

Python：深入研究彙編和解釋May 12, 2025 am 12:14 AM

pythonisehybridmodeLofCompilation和interpretation：1）thepythoninterpretercompilesourcecececodeintoplatform- interpententbybytecode.2）thepythonvirtualmachine（pvm）thenexecutecutestestestestestesthisbytecode，ballancingEaseofuseEfuseWithPerformance。

Python是一種解釋或編譯語言，為什麼重要？May 12, 2025 am 12:09 AM

pythonisbothinterpretedAndCompiled.1）它的compiledTobyTecodeForportabilityAcrosplatforms.2）bytecodeisthenInterpreted，允許fordingfordforderynamictynamictymictymictymictyandrapiddefupment，儘管Ititmaybeslowerthananeflowerthanancompiledcompiledlanguages。

對於python中的循環時循環與循環：解釋了關鍵差異May 12, 2025 am 12:08 AM

在您的知識之際，而foroopsareideal insinAdvance中，而WhileLoopSareBetterForsituations則youneedtoloopuntilaconditionismet

循環時：實用指南May 12, 2025 am 12:07 AM

ForboopSareSusedwhenthentheneMberofiterationsiskNownInAdvance，而WhileLoopSareSareDestrationsDepportonAcondition.1）ForloopSareIdealForiteratingOverSequencesLikelistSorarrays.2）whileLeleLooleSuitableApeableableableableableableforscenarioscenarioswhereTheLeTheLeTheLeTeLoopContinusunuesuntilaspecificiccificcificCondond

See all articles