Home  >  Article  >  Backend Development  >  How to use python to read the content of doc and docx documents under Ubuntu

How to use python to read the content of doc and docx documents under Ubuntu

不言
不言Original
2018-05-08 14:18:123107browse

This article mainly introduces the method of using python to read the content of doc and docx documents under Ubuntu. It has a certain reference value. Now I share it with you. Friends in need can refer to it

Read docx document

The package used is python-docx

1. Install the python-docx package

sudo pip install python-docx

2. Use the python-docx package to read data

#encoding:utf8 
import docx 
doc = docx.Document('test.docx') 
docText = '\n'.join([paragraph.text for paragraph in doc.paragraphs]) 
#print(docText)

The python-docx package cannot process doc documents. To read the contents of a doc document, you need to use the antiword tool.

Read the doc document

1. Go to the website to download antiword.

2. After downloading, unzip it and run the make and make install commands in sequence in the decompressed folder.

3. Use antiword to read the content of the doc document

#encoding:utf8 
import subprocess 
word = 'test.doc' 
output = subprocess.check_output(['antiword',word]) 
print(output)

Related recommendations:

Based on python batch processing of dat files and scientific calculation method

Example of using python to process MS Word

The above is the detailed content of How to use python to read the content of doc and docx documents under Ubuntu. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn