Python を使用して Word 文書からテキストと画像を抽出する-Python チュートリアル-php.cn

ホームページ

バックエンド開発

Python チュートリアル

Python を使用して Word 文書からテキストと画像を抽出する

王林

Aug 28, 2023 pm 06:21 PM

python写真文章抽出する

Word 文書からコンテンツを抽出すると、コンテンツをデータベースに保存したり、他のプログラムにコンテンツをインポートしたり、人工知能のトレーニングや他の文書の作成など、他の操作に使用できるようになります。 Spire.Doc for Python を使用すると、大規模なコピーアンドペーストや複雑なコーディングを行わなくても、Word 文書からテキストや画像を簡単に抽出できます。この記事では、簡単なコードを使用して Word 文書から テキストと画像コンテンツを抽出して保存する方法について説明します。

Python 用の Spire.Doc をインポートする

このツールを使用して Word 文書を編集するには、その前にそれをプロジェクトにインポートする必要があります。 Spire.Doc for Python 公式 Web サイトからダウンロードするか、pip を使用して直接インストールできます。コードは次のようになります:

pip install Spire.Doc
pip install plum-dispatch==1.7.4

マスタードキュメント

Python を使用して Word 文書からテキストと画像を抽出する

Word文書からテキストを抽出し、TXTファイルに書き込みます

Spire.Doc for Python の

Document.GetText() メソッドは、Word 文書内のすべてのテキストを取得し、文字列として返すことができます。返された文字列をテキストファイルに書き込んで保存できます。手順は次のとおりです。

Document オブジェクトを作成します。
Document.LoadFromFile() メソッドを使用して Word 文書を読み込みます。
Document.GetText() メソッドを使用してドキュメントからテキストを取得します。

コードベシュピール

パイソン

Copy
from turtle import st
from spire.doc import *
from spire.doc.common import *

def WriteAllText(fname:str,text:List[str]):
        fp = open(fname,"w")
        for s in text:
            fp.write(s)
        fp.close()

inputFile = "Beispiel.docx"
outputFile = "Extrahierter Text.txt"

#Document-Objekt erstellen  
document = Document()

#Word-Dokument laden
document.LoadFromFile(inputFile)

#Text aus Dokument abrufen
text = document.GetText()

#Text in Textdatei schreiben
WriteAllText(outputFile, text)
document.Close()

追加テキスト

Python を使用して Word 文書からテキストと画像を抽出する

Word ドキュメントの追加および説明のビルダー

Das Extrahieren von Bildern ist etwas komplexer、ob dessen untergeordnete Objekte Bilder enthalten:

Document オブジェクトを作成します。
Document.LoadFromFile() メソッドを使用して Word 文書を読み込みます。

コードベシュピール

パイソン

Copy
import queue
from spire.doc import * 
from spire.doc.common import *
import os

outputPath = "Bilder/"
inputFile = "Beispiel.docx"

if not os.path.exists(outputPath):
    os.makedirs(outputPath)

#Document-Objekt erstellen
document = Document()  

#Word-Dokument laden
document.LoadFromFile(inputFile)

#Warteschlange erstellen und Dokumentenelemente hinzufügen
nodes = queue.Queue()
nodes.put(document)

#Liste erstellen
images = []

#Dokumentenelemente durchlaufen
while nodes.qsize() > 0:
    node = nodes.get()
    for i in range(node.ChildObjects.Count):
        #Untergeordnetes Objekt des Dokumentenelements abrufen
        child = node.ChildObjects.get_Item(i)
        #Prüfen, ob es ein Bild ist
        if child.DocumentObjectType == DocumentObjectType.Picture:
            picture = child if isinstance(child, DocPicture) else None
            dataBytes = picture.ImageBytes
            #Zur Liste hinzufügen
            images.append(dataBytes)
        #Prüfen, ob es ein zusammengesetztes Objekt ist
        elif isinstance(child, ICompositeObject):
            #Zur Warteschlange hinzufügen
            nodes.put(child if isinstance(child, ICompositeObject) else None)

#Bilder speichern
for i, item in enumerate(images):
    fileName = "Bild-{}.png".format(i)
    with open(outputPath+fileName,'wb') as imageFile:
        imageFile.write(item)

document.Close()

特別な写真

Python を使用して Word 文書からテキストと画像を抽出する

追加のテキスト情報を参照してください。テキストを参照してください。

これは、Spire.Doc for Python を使用して Word 文書からテキストと画像を抽出する方法の紹介です。 Spire.Doc for Python は、他の多くのドキュメント操作をサポートしています。公式 Web サイトをチェックするか、Spire.Doc フォーラムに参加してください。

以上がPython を使用して Word 文書からテキストと画像を抽出するの詳細内容です。詳細については、PHP 中国語 Web サイトの他の関連記事を参照してください。

声明

この記事の内容はネチズンが自主的に寄稿したものであり、著作権は原著者に帰属します。このサイトは、それに相当する法的責任を負いません。盗作または侵害の疑いのあるコンテンツを見つけた場合は、admin@php.cn までご連絡ください。

どのデータ型をPythonアレイに保存できますか？Apr 27, 2025 am 12:11 AM

Pythonlistscanstoreanydatatype,arraymodulearraysstoreonetype,andNumPyarraysarefornumericalcomputations.1)Listsareversatilebutlessmemory-efficient.2)Arraymodulearraysarememory-efficientforhomogeneousdata.3)NumPyarraysareoptimizedforperformanceinscient

Pythonアレイに間違ったデータ型の値を保存しようとするとどうなりますか？Apr 27, 2025 am 12:10 AM

heouttemptemptostoreavure ofthewrongdatatypeinapythonarray、yure counteractypeerror.thisduetothearraymodule'sstricttypeeencultionyを使用します

Python Standard Libraryの一部はどれですか：リストまたは配列はどれですか？Apr 27, 2025 am 12:03 AM

PythonListSarePartOfThestAndardarenot.liestareBuilting-in、versatile、forStoringCollectionsのpythonlistarepart。

スクリプトが間違ったPythonバージョンで実行されるかどうかを確認する必要がありますか？Apr 27, 2025 am 12:01 AM

theScriptisrunningwithwrongthonversionduetorectRectDefaultEntertersettings.tofixthis：1）CheckthedededefaultHaulthonsionsingpython - versionorpython3-- version.2）usevirtualenvironmentsbycreatingonewiththon3.9-mvenvmyenv、andverixe

Pythonアレイで実行できる一般的な操作は何ですか？Apr 26, 2025 am 12:22 AM

PythonArraysSupportVariousoperations：1）SlicingExtractsSubsets、2）Appending/ExtendingAdddesements、3）inSertingSelementSatspecificpositions、4）remvingingDeletesements、5）sorting/verversingsorder、and6）listenionsionsionsionsionscreatenewlistsebasedexistin

一般的に使用されているnumpy配列はどのようなアプリケーションにありますか？Apr 26, 2025 am 12:13 AM

numpyarraysAressertialentionsionceivationsefirication-efficientnumericalcomputations andDatamanipulation.theyarecrucialindatascience、mashineelearning、物理学、エンジニアリング、および促進可能性への適用性、scaledatiencyを効率的に、forexample、infinancialanalyyy

Pythonのリスト上の配列を使用するのはいつですか？Apr 26, 2025 am 12:12 AM

UseanArray.ArrayOverAlistinPythonは、Performance-criticalCode.1）homogeneousdata：araysavememorywithpedelements.2）Performance-criticalcode：Araysofterbetterbetterfornumerumerumericaleperations.3）interf

すべてのリスト操作は配列でサポートされていますか？なぜまたはなぜですか？Apr 26, 2025 am 12:05 AM

いいえ、notallistoperationSaresuptedbyarrays、andviceversa.1）arraysdonotsupportdynamicoperationslikeappendorintorintorinsertizizing、whosimpactsporformance.2）リスト

See all articles