Extract text and images from Word documents using Python-Python Tutorial-php.cn

Home

Backend Development

Python Tutorial

Extract text and images from Word documents using Python

王林

Aug 28, 2023 pm 06:21 PM

pythonpicturetextextract

Extracting content from Word documents allows us to use them for other operations, such as storing content in databases, importing content into other programs, for artificial intelligence training and creating other documents. Spire.Doc for Python makes it easy to extract text and images from Word documents without extensive copy-and-paste or complex coding. This article explains how to extract and save text and image content from a Word document using simple code.

Import Spire.Doc for Python

Before you can use this tool to edit a Word document, you must import it into a project. You can download it from the Spire.Doc for Python official website or install it directly with pip. The code looks like this:

pip install Spire.Doc
pip install plum-dispatch==1.7.4

Musterdokument

Extract text and images from Word documents using Python

Extract text from Word document and write to TXT file

The

Document.GetText()

method of Spire.Doc for Python can retrieve all text in a Word document and return it as a string. We can write the returned string into a text file for storage. The steps are as follows:

Document object.
Document.LoadFromFile() method to load a Word document.
Document.GetText() method.

Code Bespiel

Python

Copy
from turtle import st
from spire.doc import *
from spire.doc.common import *

def WriteAllText(fname:str,text:List[str]):
        fp = open(fname,"w")
        for s in text:
            fp.write(s)
        fp.close()

inputFile = "Beispiel.docx"
outputFile = "Extrahierter Text.txt"

#Document-Objekt erstellen  
document = Document()

#Word-Dokument laden
document.LoadFromFile(inputFile)

#Text aus Dokument abrufen
text = document.GetText()

#Text in Textdatei schreiben
WriteAllText(outputFile, text)
document.Close()

Extrahierter Text

Extract text and images from Word documents using Python

Bilder aus Word-Dokument extrahieren und speichern

Das Extrahieren von Bildern ist etwas komplexer. den, ob dessen untergeordnete Objekte Bilder enthalten . Die Schritte:

Document object.
Document.LoadFromFile() method to load a Word document.

Code Bespiel

Python

Copy
import queue
from spire.doc import * 
from spire.doc.common import *
import os

outputPath = "Bilder/"
inputFile = "Beispiel.docx"

if not os.path.exists(outputPath):
    os.makedirs(outputPath)

#Document-Objekt erstellen
document = Document()  

#Word-Dokument laden
document.LoadFromFile(inputFile)

#Warteschlange erstellen und Dokumentenelemente hinzufügen
nodes = queue.Queue()
nodes.put(document)

#Liste erstellen
images = []

#Dokumentenelemente durchlaufen
while nodes.qsize() > 0:
    node = nodes.get()
    for i in range(node.ChildObjects.Count):
        #Untergeordnetes Objekt des Dokumentenelements abrufen
        child = node.ChildObjects.get_Item(i)
        #Prüfen, ob es ein Bild ist
        if child.DocumentObjectType == DocumentObjectType.Picture:
            picture = child if isinstance(child, DocPicture) else None
            dataBytes = picture.ImageBytes
            #Zur Liste hinzufügen
            images.append(dataBytes)
        #Prüfen, ob es ein zusammengesetztes Objekt ist
        elif isinstance(child, ICompositeObject):
            #Zur Warteschlange hinzufügen
            nodes.put(child if isinstance(child, ICompositeObject) else None)

#Bilder speichern
for i, item in enumerate(images):
    fileName = "Bild-{}.png".format(i)
    with open(outputPath+fileName,'wb') as imageFile:
        imageFile.write(item)

document.Close()

Extrahierte Bilder

Extract text and images from Word documents using Python

Der extrahierte Text wird mit angehängten Bewertungsinformationen gespeichert. Sie können die Bewertungsinformationen direkt am Anfang des Textes löschen.

This is an introduction to using Spire.Doc for Python to extract text and images from Word documents. Spire.Doc for Python supports many other document operations. Check out the official website or join the Spire.Doc forum.

The above is the detailed content of Extract text and images from Word documents using Python. For more information, please follow other related articles on the PHP Chinese website!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

超简单！用 Python 为图片和 PDF 去掉水印Apr 12, 2023 pm 11:43 PM

网上下载的 pdf 学习资料有一些会带有水印，非常影响阅读。比如下面的图片就是在 pdf 文件上截取出来的，今天我们就来用Python解决这个问题。安装模块PIL：Python Imaging Library 是 python 上非常强大的图像处理标准库，但是只能支持 python 2.7，于是就有志愿者在 PIL 的基础上创建了支持 python 3的 pillow，并加入了一些新的特性。pip install pillow pymupdf 可以用 python 访问扩展名为*.pdf、

PHP和GD库实现图片裁剪的方法Jul 14, 2023 am 08:57 AM

PHP和GD库实现图片裁剪的方法概述：图片裁剪是网页开发中常见的需求之一，它可以用于调整图片的尺寸，剪裁不需要的部分，以适应不同的页面布局和展示需求。在PHP开发中，我们可以借助GD库来实现图片裁剪的功能。GD库是一个强大的图形库，可提供一系列函数来处理和操控图像。代码示例：下面我们将详细介绍如何使用PHP和GD库来实现图片裁剪。首先，确保你的PHP环境已经

如何使用 Vue 实现图片预加载？Jun 25, 2023 am 11:01 AM

在网页开发中，图片预载是一种常见的技术，可以提升用户的体验感。当用户浏览网页时，图片可以提前下载并加载，减少图片加载时的等待时间。在Vue框架中，我们可以通过一些简单的方法来实现图片预载。本文将介绍Vue中的图片预载技术，包括预载的原理、实现的方法和使用注意事项。一、预载的原理首先，我们来了解一下图片预载的原理。传统的图片加载方式是等到图片全部下载完成才显示

PS AI修图免费平替来了！Stability AI又放大招，核弹级更新一键扩图Jun 12, 2023 pm 07:27 PM

此前，PS的重建图像功能就让人无比振奋，让无数人惊呼今天，StabilityAI又放大招了。它联合Clipdrop推出了UncropClipdrop——一个终极图像比例编辑器。从Uncrop这个名字上，我们就能看出它的用途。它是一个AI生成的「外画」工具，通过创建扩展背景，这个工具可以补充任何现有照片或图像，来更改任何图像的比例。敲黑板：通过Clipdrop网站，就可以免费试用这个工具了，无需登录！比例任意调，满意为止Uncrop基于StabilityAI的文本到图像模型StableDiffus

vue报错找不到图片怎么办Nov 19, 2022 pm 05:01 PM

vue报错找不到图片的解决办法：1、修改配置文件，将绝对路径改为相对路径；2、将图片作为模块加载进去，并将图片放到static目录下；3、将imageUrls引入响应的vue文件中，解析引用即可。

如何在uniapp中实现图片滤镜效果Jul 04, 2023 am 11:05 AM

如何在uniapp中实现图片滤镜效果在移动应用开发中，图片滤镜效果是一种常见且受用户喜爱的功能之一。而在uniapp中，实现图片滤镜效果也并不复杂。本文将为大家介绍如何通过uniapp实现图片滤镜效果，并附上相关代码示例。导入图片首先，我们需要在uniapp项目中导入一张图片，以供后续滤镜效果的处理。可以在项目的资源文件夹中放置一张命名为“filter.jp

php写图片不显示不出来怎么办Nov 14, 2022 am 10:17 AM

php写图片不显示不出来的解决办法：1、找到并打开php.ini文件；2、找到“extension=php_gd2.dll”，并将前面的分号去掉；3、重新启动服务器；4、在绘图前清一下缓存即可。

AI去除马赛克，可还行？Apr 09, 2023 pm 07:11 PM

哈喽，大家好。你有没有想过用 AI 技术去除马赛克？仔细想想这个问题还挺难的，因为我们之前使用的 AI 技术，不管是人脸识别还是OCR识别，起码人工能识别出来。但如果给你一张打上马赛克的图片，你能把它复原吗？显然是很难的。如果人都无法复原，又怎能教会计算机去复原呢？还记得前几天我写的一篇《用AI生成头像》文章吗。在那篇文章中，我们训练了一个DCGAN模型，它可以从任意随机数生成一个图像。随机数作为像素生成的噪声图模型从随机数生成正常头像DCGAN包含生成器模型和判别器模型两个模型组成，生成

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

Repo: How To Revive Teammates

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

2 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hello Kitty Island Adventure: How To Get Giant Seeds

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

How Long Does It Take To Beat Split Fiction?

4 weeks agoByDDD

R.E.P.O. Save File Location: Where Is It & How to Protect It?

4 weeks agoByDDD

Hot Tools

SublimeText3 English version

Recommended: Win version, supports code prompts!

Safe Exam Browser

Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

Zend Studio 13.0.1

Powerful PHP integrated development environment

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

Hot Topics

Where is the login entrance for gmail email?

7369

1628

1354

1266

1214