怎麼用兩行Python程式碼實現pdf轉word功能-Python教學-PHP中文網

首頁

後端開發

Python教學

怎麼用兩行Python程式碼實現pdf轉word功能

王林

Apr 28, 2023 pm 06:25 PM

wordpythonpdf

##一、安裝依賴套件

pip install --index https://pypi.mirrors.ustc.edu.cn/simple/ python-office

二、pdf轉word

2.1程式碼實作

import office
office.pdf.pdf2docx(file_path = &#39;test.pdf&#39;)

運行過程如下：

[1/4]正在開啟文件...
[INFO][2/4]正在分析文件...
[警告]「建立」時間戳似乎很低;作為unix時間戳
[警告]「修改」時間戳似乎很低；作為unix時間戳
[警告]「創建」時間戳似乎非常低；作為unix時間戳
[警告]「修改」時間戳似乎很低；作為unix時間戳
[INFO] [3/4] 解析頁...
[INFO] (1/9) 第1頁
[INFO] (2/9) 第2頁
[訊息] (3/9) 第3 頁
[訊息] (4/9) 第4 頁
[訊息] (5/9) 第5 頁
[訊息] (6/9 ) 第6 頁
[INFO] (7/9) 第7頁
[INFO] (8/9) 第8頁
[INFO] (9/9) 第9頁
[ INFO] [4/4] 建立頁面...
[資訊] (1/9) 第1 頁
[資訊] (2/9) 第2 頁
[資訊] (3/9 ) 第3 頁
[訊息] (4 /9) 第4 頁
[訊息] (5/9) 第5 頁
[訊息] (6/9) 第6 頁
[訊息] (7/9) 第7 頁
[訊息] (8/9) 第8 頁
[訊息] (9/9) 第9 頁
[訊息] 在1.30 秒內終止。

進程已完成，退出程式碼為0

2.2 pdf內容

怎麼用兩行Python程式碼實現pdf轉word功能

#2.3轉換後的單字

怎麼用兩行Python程式碼實現pdf轉word功能

由上可見，效果還不錯。

補充

#除了以上的方法，小編還為大家整理了更多Python實現的PDF轉Word方法，需要的可以參考

方法一：

import os
from configparser import ConfigParser
from io import StringIO
from io import open
from concurrent.futures import ProcessPoolExecutor

from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import process_pdf
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from docx import Document


def read_from_pdf(file_path):
    with open(file_path, &#39;rb&#39;) as file:
        resource_manager = PDFResourceManager()
        return_str = StringIO()
        lap_params = LAParams()

        device = TextConverter(
            resource_manager, return_str, laparams=lap_params)
        process_pdf(resource_manager, device, file)
        device.close()

        content = return_str.getvalue()
        return_str.close()
        return content


def save_text_to_word(content, file_path):
    doc = Document()
    for line in content.split(&#39;\n&#39;):
        paragraph = doc.add_paragraph()
        paragraph.add_run(remove_control_characters(line))
    doc.save(file_path)


def remove_control_characters(content):
    mpa = dict.fromkeys(range(32))
    return content.translate(mpa)


def pdf_to_word(pdf_file_path, word_file_path):
    content = read_from_pdf(pdf_file_path)
    save_text_to_word(content, word_file_path)


def main():
    config_parser = ConfigParser()
    config_parser.read(&#39;config.cfg&#39;)
    config = config_parser[&#39;default&#39;]

    tasks = []
    with ProcessPoolExecutor(max_workers=int(config[&#39;max_worker&#39;])) as executor:
        for file in os.listdir(config[&#39;pdf_folder&#39;]):
            extension_name = os.path.splitext(file)[1]
            if extension_name != &#39;.pdf&#39;:
                continue
            file_name = os.path.splitext(file)[0]
            pdf_file = config[&#39;pdf_folder&#39;] + &#39;/&#39; + file
            word_file = config[&#39;word_folder&#39;] + &#39;/&#39; + file_name + &#39;.docx&#39;
            print(&#39;正在处理: &#39;, file)
            result = executor.submit(pdf_to_word, pdf_file, word_file)
            tasks.append(result)
    while True:
        exit_flag = True
        for task in tasks:
            if not task.done():
                exit_flag = False
        if exit_flag:
            print(&#39;完成&#39;)
            exit(0)


if __name__ == &#39;__main__&#39;:
    main()

方法二：

加密過的PDF轉字

#-*- coding: UTF-8 -*- 
#!/usr/bin/python
#-*- coding: utf-8 -*-
import sys
import importlib
importlib.reload(sys)
from pdfminer.pdfparser import PDFParser,PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import *
from pdfminer.pdfinterp import PDFTextExtractionNotAllowed
import os
#设置工作目录文件夹
os.chdir(r&#39;c:/users/dicey/desktop/codes/pdf-docx&#39;)
#解析pdf文件函数
def parse(pdf_path):
 fp = open(&#39;diya.pdf&#39;, &#39;rb&#39;) # 以二进制读模式打开
 # 用文件对象来创建一个pdf文档分析器
 parser = PDFParser(fp)
 # 创建一个PDF文档
 doc = PDFDocument()
 # 连接分析器 与文档对象
 parser.set_document(doc)
 doc.set_parser(parser)
 # 提供初始化密码
 # 如果没有密码 就创建一个空的字符串
 doc.initialize()
 # 检测文档是否提供txt转换，不提供就忽略
 if not doc.is_extractable:
  raise PDFTextExtractionNotAllowed
 else:
  # 创建PDf 资源管理器 来管理共享资源
  rsrcmgr = PDFResourceManager()
  # 创建一个PDF设备对象
  laparams = LAParams()
  device = PDFPageAggregator(rsrcmgr, laparams=laparams)
  # 创建一个PDF解释器对象
  interpreter = PDFPageInterpreter(rsrcmgr, device)
  # 用来计数页面，图片，曲线，figure，水平文本框等对象的数量
  num_page, num_image, num_curve, num_figure, num_TextBoxHorizontal = 0, 0, 0, 0, 0
  # 循环遍历列表，每次处理一个page的内容
  for page in doc.get_pages(): # doc.get_pages() 获取page列表
   num_page += 1 # 页面增一
   interpreter.process_page(page)
   # 接受该页面的LTPage对象
   layout = device.get_result()
   for x in layout:
    if isinstance(x,LTImage): # 图片对象
     num_image += 1
    if isinstance(x,LTCurve): # 曲线对象
     num_curve += 1
    if isinstance(x,LTFigure): # figure对象
     num_figure += 1
    if isinstance(x, LTTextBoxHorizontal): # 获取文本内容
     num_TextBoxHorizontal += 1 # 水平文本框对象增一
     # 保存文本内容
     with open(r&#39;test2.doc&#39;, &#39;a&#39;,encoding=&#39;utf-8&#39;) as f: #生成doc文件的文件名及路径
      results = x.get_text()
      f.write(results)
      f.write(&#39;\n&#39;)
  print(&#39;对象数量：\n&#39;,&#39;页面数：%s\n&#39;%num_page,&#39;图片数：%s\n&#39;%num_image,&#39;曲线数：%s\n&#39;%num_curve,&#39;水平文本框：%s\n&#39;
    %num_TextBoxHorizontal)

if __name__ == &#39;__main__&#39;:
 pdf_path = r&#39;diya.pdf&#39; #pdf文件路径及文件名
 parse(pdf_path)

以上是怎麼用兩行Python程式碼實現pdf轉word功能的詳細內容。更多資訊請關注PHP中文網其他相關文章！

陳述

本文轉載於：亿速云。如有侵權，請聯絡admin@php.cn刪除

您如何切成python列表？May 02, 2025 am 12:14 AM

SlicingaPythonlistisdoneusingthesyntaxlist[start:stop:step].Here'showitworks:1)Startistheindexofthefirstelementtoinclude.2)Stopistheindexofthefirstelementtoexclude.3)Stepistheincrementbetweenelements.It'susefulforextractingportionsoflistsandcanuseneg

在Numpy陣列上可以執行哪些常見操作？May 02, 2025 am 12:09 AM

numpyallowsforvariousoperationsonArrays：1）basicarithmeticlikeaddition，減法，乘法和division; 2）evationAperationssuchasmatrixmultiplication; 3）element-wiseOperations wiseOperationswithOutexpliitloops; 4）

Python的數據分析中如何使用陣列？May 02, 2025 am 12:09 AM

Arresinpython，尤其是Throughnumpyandpandas，weessentialFordataAnalysis，offeringSpeedAndeffied.1）NumpyArseNable efflaysenable efficefliceHandlingAtaSetSetSetSetSetSetSetSetSetSetSetsetSetSetSetSetsopplexoperationslikemovingaverages.2）

列表的內存足跡與python數組的內存足跡相比如何？May 02, 2025 am 12:08 AM

列表sandnumpyArraysInpythonHavedIfferentMemoryfootprints：listSaremoreFlexibleButlessMemory-效率，而alenumpyArraySareSareOptimizedFornumericalData.1）listsStorReereReereReereReereFerenceStoObjects，with withOverHeadeBheadaroundAroundaround64byty64-bitsysysysysysysysysyssyssyssyssysssyssys2）

部署可執行的Python腳本時，如何處理特定環境的配置？May 02, 2025 am 12:07 AM

toensurepythonscriptsbehavecorrectlyacrycrosdevelvermations，分期和生產，USETHESTERTATE：1）Environment varriablesForsimplesettings，2）configurationfilesfilesForcomPlexSetups，3）dynamiCofforComplexSetups，dynamiqualloadingForaptaptibality.eachmethodoffersuniquebeneiquebeneqeniquebenefitsandrefitsandrequiresandrequiresandrequiresca

您如何切成python陣列？May 01, 2025 am 12:18 AM

Python列表切片的基本語法是list[start:stop:step]。 1.start是包含的第一個元素索引，2.stop是排除的第一個元素索引，3.step決定元素之間的步長。切片不僅用於提取數據，還可以修改和反轉列表。

在什麼情況下，列表的表現比數組表現更好？May 01, 2025 am 12:06 AM

ListSoutPerformarRaysin：1）DynamicsizicsizingandFrequentInsertions/刪除，2）儲存的二聚體和3）MemoryFeliceFiceForceforseforsparsedata，butmayhaveslightperformancecostsinclentoperations。

如何將Python數組轉換為Python列表？May 01, 2025 am 12:05 AM

toConvertapythonarraytoalist，usEthelist（）constructororageneratorexpression.1）intimpthearraymoduleandcreateanArray.2）USELIST（ARR）或[XFORXINARR] to ConconverTittoalist，請考慮performorefformanceandmemoryfformanceandmemoryfformienceforlargedAtasetset。

See all articles