Home > Article > Backend Development > How to use python to read the content of doc and docx documents under Ubuntu
This article mainly introduces the method of using python to read the content of doc and docx documents under Ubuntu. It has a certain reference value. Now I share it with you. Friends in need can refer to it
Read docx document
The package used is python-docx
1. Install the python-docx package
sudo pip install python-docx
2. Use the python-docx package to read data
#encoding:utf8 import docx doc = docx.Document('test.docx') docText = '\n'.join([paragraph.text for paragraph in doc.paragraphs]) #print(docText)
The python-docx package cannot process doc documents. To read the contents of a doc document, you need to use the antiword tool.
Read the doc document
1. Go to the website to download antiword.
2. After downloading, unzip it and run the make and make install commands in sequence in the decompressed folder.
3. Use antiword to read the content of the doc document
#encoding:utf8 import subprocess word = 'test.doc' output = subprocess.check_output(['antiword',word]) print(output)
Related recommendations:
Based on python batch processing of dat files and scientific calculation method
Example of using python to process MS Word
The above is the detailed content of How to use python to read the content of doc and docx documents under Ubuntu. For more information, please follow other related articles on the PHP Chinese website!