집 >운영 및 유지보수 >리눅스 운영 및 유지 관리 >Linux 시스템에서 node.js를 사용하여 Word 및 PDF 텍스트 콘텐츠를 추출하는 방법에 대한 사례 소개

Linux 시스템에서 node.js를 사용하여 Word 및 PDF 텍스트 콘텐츠를 추출하는 방법에 대한 사례 소개

黄舟원래의: 2017-06-18 09:11:012044검색

이 글에서는 주로 node.js를 사용하여 Linux 시스템에서 Word(doc/docx) 및 PDF 텍스트를 추출하는 방법을 소개합니다. 이 글에서는 참고하고 학습할 수 있도록 자세한 샘플 코드를 제공합니다. 편집자와 함께 살펴보세요.

Foreword

전체 텍스트 검색 엔진을 구축하려면 word/pdf와 같은 문서의 내용을 추출해야 합니다. PDF의 경우 xpdf와 같은 오픈 소스 솔루션이 있습니다.

하지만 Word 문서의 상황은 좀 더 복잡합니다.

PDF 텍스트 내용 추출

Debian Linux에서의 설치는 매우 간단합니다.

apt-get install xpdf

여기서는 pdftotext 함수만 사용합니다. 직접 입력하여 도움말을 볼 수 있습니다.

root@raspberrypi:/var/www# pdftotext
pdftotext version 0.26.5
Copyright 2005-2014 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
Usage: pdftotext [options] <PDF-file> [<text-file>]
 -f <int>   : first page to convert
 -l <int>   : last page to convert
 -r <fp>   : resolution, in DPI (default is 72)
 -x <int>   : x-coordinate of the crop area top left corner
 -y <int>   : y-coordinate of the crop area top left corner
 -W <int>   : width of crop area in pixels (default is 0)
 -H <int>   : height of crop area in pixels (default is 0)
 -layout   : maintain original physical layout
 -fixed <fp>  : assume fixed-pitch (or tabular) text
 -raw    : keep strings in content stream order
 -htmlmeta   : generate a simple HTML file, including the meta information
 -enc <string>  : output text encoding name
 -listenc   : list available encodings
 -eol <string>  : output end-of-line convention (unix, dos, or mac)
 -nopgbrk   : don&#39;t insert page breaks between pages
 -bbox    : output bounding box for each word and page size to html. Sets -htmlmeta
 -opw <string>  : owner password (for encrypted files)
 -upw <string>  : user password (for encrypted files)
 -q    : don&#39;t print any messages or errors
 -v    : print copyright and version info
 -h    : print usage information
 -help    : print usage information
 --help   : print usage information
 -?    : print usage information

테스트해 보세요.

root@raspberrypi:/var/www# pdftotext onceai.pdf onceai.txt
root@raspberrypi:/var/www# cat onceai.txt 产品介绍 顽石智能科技（上海）有限公司
....

그런 다음 node.js에서 child_process를 사용하여 이 명령을 직접 호출하세요. pdftotext 콘텐츠는 텍스트 파일로 출력되며, 여기에는 추가 작업이 필요할 수 있습니다. 구체적인 코드는 생략합니다.

antiword를 사용하여 .doc의 콘텐츠를 추출하세요

여기서 antiword 오픈 소스 소프트웨어를 사용하여 이전 버전의 word2003 콘텐츠를 추출합니다. 설치도 매우 간단합니다.

apt-get install antiword

도움말 보기:

root@raspberrypi:/var/www# antiword
 Name: antiword
 Purpose: Display MS-Word files
 Author: (C) 1998-2005 Adri van Os
 Version: 0.37 (21 Oct 2005)
 Status: GNU General Public License
 Usage: antiword [switches] wordfile1 [wordfile2 ...]
 Switches: [-f|-t|-a papersize|-p papersize|-x dtd][-m mapping][-w #][-i #][-Ls]
  -f formatted text output
  -t text output (default)
  -a <paper size name> Adobe PDF output
  -p <paper size name> PostScript output
   paper size like: a4, letter or legal
  -x <dtd> XML output
   like: db (DocBook)
  -m <mapping> character mapping file
  -w <width> in characters of text output
  -i <level> image level (PostScript only)
  -L use landscape mode (PostScript only)
  -r Show removed text
  -s Show hidden (by Word) text

antiword direct 단어 content를 콘솔에 출력합니다.

root@raspberrypi:/var/www# antiword spec.doc

SYNC Mobile – Ford APA
Project Number: DFYST
Requirements Specification

마찬가지로 node.js에서 child_process를 사용하여 이 명령을 호출합니다.

.docx

docx 문서의 경우 유전자 자체가 zip 파일이므로 먼저 node.js에서 압축을 푼 다음 text.docxword를 구문 분석하면 됩니다. document.xml 파일만 있으면 됩니다.

요약

위 내용은 Linux 시스템에서 node.js를 사용하여 Word 및 PDF 텍스트 콘텐츠를 추출하는 방법에 대한 사례 소개의 상세 내용입니다. 자세한 내용은 PHP 중국어 웹사이트의 기타 관련 기사를 참조하세요!

성명：

이전 기사：Linux에서 웹사이트 디렉터리에 권한을 할당하는 방법에 대한 샘플 코드 공유다음 기사：Linux에서 웹사이트 디렉터리에 권한을 할당하는 방법에 대한 샘플 코드 공유