search
HomeBackend DevelopmentPython TutorialHow to use two lines of Python code to convert pdf to word

    一、安装依赖包

    pip install --index https://pypi.mirrors.ustc.edu.cn/simple/ python-office

    二、pdf转word

    2.1 代码实现

    import office
    office.pdf.pdf2docx(file_path = 'test.pdf')

    运行过程如下:

    [1/4] Opening document...
    [INFO] [2/4] Analyzing document...
    [WARNING] 'created' timestamp seems very low; regarding as unix timestamp
    [WARNING] 'modified' timestamp seems very low; regarding as unix timestamp
    [WARNING] 'created' timestamp seems very low; regarding as unix timestamp
    [WARNING] 'modified' timestamp seems very low; regarding as unix timestamp
    [INFO] [3/4] Parsing pages...
    [INFO] (1/9) Page 1
    [INFO] (2/9) Page 2
    [INFO] (3/9) Page 3
    [INFO] (4/9) Page 4
    [INFO] (5/9) Page 5
    [INFO] (6/9) Page 6
    [INFO] (7/9) Page 7
    [INFO] (8/9) Page 8
    [INFO] (9/9) Page 9
    [INFO] [4/4] Creating pages...
    [INFO] (1/9) Page 1
    [INFO] (2/9) Page 2
    [INFO] (3/9) Page 3
    [INFO] (4/9) Page 4
    [INFO] (5/9) Page 5
    [INFO] (6/9) Page 6
    [INFO] (7/9) Page 7
    [INFO] (8/9) Page 8
    [INFO] (9/9) Page 9
    [INFO] Terminated in 1.30s.
     
    Process finished with exit code 0

    2.2 pdf内容

    How to use two lines of Python code to convert pdf to word

    2.3 转换后的word

    How to use two lines of Python code to convert pdf to word

    由上可见,效果还不错。

    补充

    除了上文的办法,小编还为大家整理了更多Python实现的PDF转Word方法,需要的可以参考一下

    方法一:

    import os
    from configparser import ConfigParser
    from io import StringIO
    from io import open
    from concurrent.futures import ProcessPoolExecutor
    
    from pdfminer.pdfinterp import PDFResourceManager
    from pdfminer.pdfinterp import process_pdf
    from pdfminer.converter import TextConverter
    from pdfminer.layout import LAParams
    from docx import Document
    
    
    def read_from_pdf(file_path):
        with open(file_path, 'rb') as file:
            resource_manager = PDFResourceManager()
            return_str = StringIO()
            lap_params = LAParams()
    
            device = TextConverter(
                resource_manager, return_str, laparams=lap_params)
            process_pdf(resource_manager, device, file)
            device.close()
    
            content = return_str.getvalue()
            return_str.close()
            return content
    
    
    def save_text_to_word(content, file_path):
        doc = Document()
        for line in content.split('\n'):
            paragraph = doc.add_paragraph()
            paragraph.add_run(remove_control_characters(line))
        doc.save(file_path)
    
    
    def remove_control_characters(content):
        mpa = dict.fromkeys(range(32))
        return content.translate(mpa)
    
    
    def pdf_to_word(pdf_file_path, word_file_path):
        content = read_from_pdf(pdf_file_path)
        save_text_to_word(content, word_file_path)
    
    
    def main():
        config_parser = ConfigParser()
        config_parser.read('config.cfg')
        config = config_parser['default']
    
        tasks = []
        with ProcessPoolExecutor(max_workers=int(config['max_worker'])) as executor:
            for file in os.listdir(config['pdf_folder']):
                extension_name = os.path.splitext(file)[1]
                if extension_name != '.pdf':
                    continue
                file_name = os.path.splitext(file)[0]
                pdf_file = config['pdf_folder'] + '/' + file
                word_file = config['word_folder'] + '/' + file_name + '.docx'
                print('正在处理: ', file)
                result = executor.submit(pdf_to_word, pdf_file, word_file)
                tasks.append(result)
        while True:
            exit_flag = True
            for task in tasks:
                if not task.done():
                    exit_flag = False
            if exit_flag:
                print('完成')
                exit(0)
    
    
    if __name__ == '__main__':
        main()

    方法二:

    加密过的PDF转word

    #-*- coding: UTF-8 -*- 
    #!/usr/bin/python
    #-*- coding: utf-8 -*-
    import sys
    import importlib
    importlib.reload(sys)
    from pdfminer.pdfparser import PDFParser,PDFDocument
    from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
    from pdfminer.converter import PDFPageAggregator
    from pdfminer.layout import *
    from pdfminer.pdfinterp import PDFTextExtractionNotAllowed
    import os
    #设置工作目录文件夹
    os.chdir(r'c:/users/dicey/desktop/codes/pdf-docx')
    #解析pdf文件函数
    def parse(pdf_path):
     fp = open('diya.pdf', 'rb') # 以二进制读模式打开
     # 用文件对象来创建一个pdf文档分析器
     parser = PDFParser(fp)
     # 创建一个PDF文档
     doc = PDFDocument()
     # 连接分析器 与文档对象
     parser.set_document(doc)
     doc.set_parser(parser)
     # 提供初始化密码
     # 如果没有密码 就创建一个空的字符串
     doc.initialize()
     # 检测文档是否提供txt转换,不提供就忽略
     if not doc.is_extractable:
      raise PDFTextExtractionNotAllowed
     else:
      # 创建PDf 资源管理器 来管理共享资源
      rsrcmgr = PDFResourceManager()
      # 创建一个PDF设备对象
      laparams = LAParams()
      device = PDFPageAggregator(rsrcmgr, laparams=laparams)
      # 创建一个PDF解释器对象
      interpreter = PDFPageInterpreter(rsrcmgr, device)
      # 用来计数页面,图片,曲线,figure,水平文本框等对象的数量
      num_page, num_image, num_curve, num_figure, num_TextBoxHorizontal = 0, 0, 0, 0, 0
      # 循环遍历列表,每次处理一个page的内容
      for page in doc.get_pages(): # doc.get_pages() 获取page列表
       num_page += 1 # 页面增一
       interpreter.process_page(page)
       # 接受该页面的LTPage对象
       layout = device.get_result()
       for x in layout:
        if isinstance(x,LTImage): # 图片对象
         num_image += 1
        if isinstance(x,LTCurve): # 曲线对象
         num_curve += 1
        if isinstance(x,LTFigure): # figure对象
         num_figure += 1
        if isinstance(x, LTTextBoxHorizontal): # 获取文本内容
         num_TextBoxHorizontal += 1 # 水平文本框对象增一
         # 保存文本内容
         with open(r'test2.doc', 'a',encoding='utf-8') as f: #生成doc文件的文件名及路径
          results = x.get_text()
          f.write(results)
          f.write('\n')
      print('对象数量:\n','页面数:%s\n'%num_page,'图片数:%s\n'%num_image,'曲线数:%s\n'%num_curve,'水平文本框:%s\n'
        %num_TextBoxHorizontal)
    
    if __name__ == '__main__':
     pdf_path = r'diya.pdf' #pdf文件路径及文件名
     parse(pdf_path)

    The above is the detailed content of How to use two lines of Python code to convert pdf to word. For more information, please follow other related articles on the PHP Chinese website!

    Statement
    This article is reproduced at:亿速云. If there is any infringement, please contact admin@php.cn delete
    How do NumPy arrays differ from the arrays created using the array module?How do NumPy arrays differ from the arrays created using the array module?Apr 24, 2025 pm 03:53 PM

    NumPyarraysarebetterfornumericaloperationsandmulti-dimensionaldata,whilethearraymoduleissuitableforbasic,memory-efficientarrays.1)NumPyexcelsinperformanceandfunctionalityforlargedatasetsandcomplexoperations.2)Thearraymoduleismorememory-efficientandfa

    How does the use of NumPy arrays compare to using the array module arrays in Python?How does the use of NumPy arrays compare to using the array module arrays in Python?Apr 24, 2025 pm 03:49 PM

    NumPyarraysarebetterforheavynumericalcomputing,whilethearraymoduleismoresuitableformemory-constrainedprojectswithsimpledatatypes.1)NumPyarraysofferversatilityandperformanceforlargedatasetsandcomplexoperations.2)Thearraymoduleislightweightandmemory-ef

    How does the ctypes module relate to arrays in Python?How does the ctypes module relate to arrays in Python?Apr 24, 2025 pm 03:45 PM

    ctypesallowscreatingandmanipulatingC-stylearraysinPython.1)UsectypestointerfacewithClibrariesforperformance.2)CreateC-stylearraysfornumericalcomputations.3)PassarraystoCfunctionsforefficientoperations.However,becautiousofmemorymanagement,performanceo

    Define 'array' and 'list' in the context of Python.Define 'array' and 'list' in the context of Python.Apr 24, 2025 pm 03:41 PM

    InPython,a"list"isaversatile,mutablesequencethatcanholdmixeddatatypes,whilean"array"isamorememory-efficient,homogeneoussequencerequiringelementsofthesametype.1)Listsareidealfordiversedatastorageandmanipulationduetotheirflexibility

    Is a Python list mutable or immutable? What about a Python array?Is a Python list mutable or immutable? What about a Python array?Apr 24, 2025 pm 03:37 PM

    Pythonlistsandarraysarebothmutable.1)Listsareflexibleandsupportheterogeneousdatabutarelessmemory-efficient.2)Arraysaremorememory-efficientforhomogeneousdatabutlessversatile,requiringcorrecttypecodeusagetoavoiderrors.

    Python vs. C  : Understanding the Key DifferencesPython vs. C : Understanding the Key DifferencesApr 21, 2025 am 12:18 AM

    Python and C each have their own advantages, and the choice should be based on project requirements. 1) Python is suitable for rapid development and data processing due to its concise syntax and dynamic typing. 2)C is suitable for high performance and system programming due to its static typing and manual memory management.

    Python vs. C  : Which Language to Choose for Your Project?Python vs. C : Which Language to Choose for Your Project?Apr 21, 2025 am 12:17 AM

    Choosing Python or C depends on project requirements: 1) If you need rapid development, data processing and prototype design, choose Python; 2) If you need high performance, low latency and close hardware control, choose C.

    Reaching Your Python Goals: The Power of 2 Hours DailyReaching Your Python Goals: The Power of 2 Hours DailyApr 20, 2025 am 12:21 AM

    By investing 2 hours of Python learning every day, you can effectively improve your programming skills. 1. Learn new knowledge: read documents or watch tutorials. 2. Practice: Write code and complete exercises. 3. Review: Consolidate the content you have learned. 4. Project practice: Apply what you have learned in actual projects. Such a structured learning plan can help you systematically master Python and achieve career goals.

    See all articles

    Hot AI Tools

    Undresser.AI Undress

    Undresser.AI Undress

    AI-powered app for creating realistic nude photos

    AI Clothes Remover

    AI Clothes Remover

    Online AI tool for removing clothes from photos.

    Undress AI Tool

    Undress AI Tool

    Undress images for free

    Clothoff.io

    Clothoff.io

    AI clothes remover

    Video Face Swap

    Video Face Swap

    Swap faces in any video effortlessly with our completely free AI face swap tool!

    Hot Tools

    EditPlus Chinese cracked version

    EditPlus Chinese cracked version

    Small size, syntax highlighting, does not support code prompt function

    Notepad++7.3.1

    Notepad++7.3.1

    Easy-to-use and free code editor

    SublimeText3 Chinese version

    SublimeText3 Chinese version

    Chinese version, very easy to use

    Dreamweaver Mac version

    Dreamweaver Mac version

    Visual web development tools

    MinGW - Minimalist GNU for Windows

    MinGW - Minimalist GNU for Windows

    This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.