Home  >  Article  >  Web Front-end  >  Detailed explanation of the example of using node.js to extract Word content in Linux system

Detailed explanation of the example of using node.js to extract Word content in Linux system

零下一度
零下一度Original
2017-06-19 09:11:011723browse

This article mainly introduces you to the use of node.js to extract Word (doc/docx) and PDF text in the Linux system. The article gives detailed sample codes for your reference and study. , friends who need it, please follow the editor to take a look.

Preface

#If you want to build a full-text search engine, you need to extract the content of word/pdf and other documents. For pdf, there are some open source solutions such as xpdf.

But the situation with Word documents is more complicated.

Extract PDF text content

XPDF is a free and open source software for displaying PDF files and converting PDFs Into text picture, etc., the Windows version is also supported. Installation on Debian Linux is very simple:


apt-get install xpdf

We only use the pdftotext function here. You can view the help by typing directly:


root@raspberrypi:/var/www# pdftotext
pdftotext version 0.26.5
Copyright 2005-2014 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
Usage: pdftotext [options] <PDF-file> [<text-file>]
 -f <int>   : first page to convert
 -l <int>   : last page to convert
 -r <fp>   : resolution, in DPI (default is 72)
 -x <int>   : x-coordinate of the crop area top left corner
 -y <int>   : y-coordinate of the crop area top left corner
 -W <int>   : width of crop area in pixels (default is 0)
 -H <int>   : height of crop area in pixels (default is 0)
 -layout   : maintain original physical layout
 -fixed <fp>  : assume fixed-pitch (or tabular) text
 -raw    : keep strings in content stream order
 -htmlmeta   : generate a simple HTML file, including the meta information
 -enc <string>  : output text encoding name
 -listenc   : list available encodings
 -eol <string>  : output end-of-line convention (unix, dos, or mac)
 -nopgbrk   : don&#39;t insert page breaks between pages
 -bbox    : output bounding box for each word and page size to html. Sets -htmlmeta
 -opw <string>  : owner password (for encrypted files)
 -upw <string>  : user password (for encrypted files)
 -q    : don&#39;t print any messages or errors
 -v    : print copyright and version info
 -h    : print usage information
 -help    : print usage information
 --help   : print usage information
 -?    : print usage information

Test it:


root@raspberrypi:/var/www# pdftotext onceai.pdf onceai.txt
root@raspberrypi:/var/www# cat onceai.txt 产品介绍 顽石智能科技(上海)有限公司
....

Then use child_process in node.js to directly call this command. pdftotext will output the content into a text file. You may need More actions. The specific code is omitted.

Use antiword to extract the content of .doc

We use antiword open source software here to extract the content of previous versions of word2003. The installation is also very easy. Simple:


apt-get install antiword

View help:


##

root@raspberrypi:/var/www# antiword
 Name: antiword
 Purpose: Display MS-Word files
 Author: (C) 1998-2005 Adri van Os
 Version: 0.37 (21 Oct 2005)
 Status: GNU General Public License
 Usage: antiword [switches] wordfile1 [wordfile2 ...]
 Switches: [-f|-t|-a papersize|-p papersize|-x dtd][-m mapping][-w #][-i #][-Ls]
  -f formatted text output
  -t text output (default)
  -a <paper size name> Adobe PDF output
  -p <paper size name> PostScript output
   paper size like: a4, letter or legal
  -x <dtd> XML output
   like: db (DocBook)
  -m <mapping> character mapping file
  -w <width> in characters of text output
  -i <level> image level (PostScript only)
  -L use landscape mode (PostScript only)
  -r Show removed text
  -s Show hidden (by Word) text

antiword directly outputs the word content to the console:


root@raspberrypi:/var/www# antiword spec.doc

SYNC Mobile – Ford APA
Project Number: DFYST
Requirements Specification

You can also call this command using child_process in node.js.

Parse and extract the content of .docx

For the docx document, the gene itself is a zip file, you only need to node.js first decompresses it and then parses the text.docx\word\

document.xml file.

There are also some libraries on Github that parse docx into html,

such as:

github.com/mwilliamson/mammoth.js

github.com /lalalic/docx2html

etc.

The above is the detailed content of Detailed explanation of the example of using node.js to extract Word content in Linux system. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn