Extracting Text from PDFs with PDFMiner in Python
Question:
How can I extract text from a PDF file using PDFMiner in Python?
Answer:
Due to recent updates in PDFMiner's API, some existing documentation may contain outdated code. To extract text from a PDF file using the latest version of PDFMiner, follow these steps:
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.converter import TextConverter from pdfminer.layout import LAParams from pdfminer.pdfpage import PDFPage from io import StringIO def extract_pdf_text(path): rsrcmgr = PDFResourceManager() retstr = StringIO() codec = 'utf-8' laparams = LAParams() device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams) fp = open(path, 'rb') interpreter = PDFPageInterpreter(rsrcmgr, device) password = "" maxpages = 0 caching = True pagenos = set() for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password, caching=caching, check_extractable=True): interpreter.process_page(page) text = retstr.getvalue() fp.close() device.close() retstr.close() return text
This updated code addresses the changes in PDFMiner's syntax. It successfully extracts text from PDF files, as verified with Python 3.x, 3.7, and October 3, 2019 Python 3.7 using pdfminer.six, released in November 2018.
以上是如何在 Python 中使用 PDFMiner 從 PDF 中提取文字?的詳細內容。更多資訊請關注PHP中文網其他相關文章!