Maison >développement back-end >Tutoriel Python >Comment extraire du texte de fichiers PDF à l'aide de la dernière version de PDFMiner en Python ?
Extracting Text from PDF Files with PDFMiner in Python
Question:
How can I extract text from a PDF file using the latest version of PDFMiner in Python?
Answer:
PDFMiner has undergone significant API updates recently. Here's how you can extract text using its current version:
<code class="python">from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.converter import TextConverter from pdfminer.layout import LAParams from pdfminer.pdfpage import PDFPage from io import StringIO def convert_pdf_to_txt(path): rsrcmgr = PDFResourceManager() retstr = StringIO() codec = 'utf-8' laparams = LAParams() device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams) fp = open(path, 'rb') interpreter = PDFPageInterpreter(rsrcmgr, device) password = "" maxpages = 0 caching = True pagenos=set() for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True): interpreter.process_page(page) text = retstr.getvalue() fp.close() device.close() retstr.close() return text</code>
Note: This solution addresses the API changes introduced by PDFMiner's recent updates, ensuring compatibility with the current version of the library.
Ce qui précède est le contenu détaillé de. pour plus d'informations, suivez d'autres articles connexes sur le site Web de PHP en chinois!