Rumah > Soal Jawab > teks badan
使用过pypdf 对英文pdf文档处理比较简单,但是对中文的支持好像不太好
使用过textract 看文档支持的格式比较多方法也比较简单,但是老师出错
import textract
import pyPdf
import pdf2text
import pdfminer
import chardet
text = textract.process("F:ll.pdf",method = 'pdfminer')
print text
import textract
import pyPdf
import pdfminer
import chardet
text = textract.process("F:ll.pdf",method = 'pdfminer')
print text
少使用了pdf2text库,但是出错情况好像不一样。
pdfminer库还没看过,看着好像麻烦一些, 求解一下解析提取中文的pdf的方法。谢谢
PHPz2017-04-18 10:26:58
pdfminer yang saya gunakan sebelum ini pip install pdfminer
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests
import re
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from cStringIO import StringIO
#from io import StringIO for python3
from io import open
from pdfminer.pdfpage import PDFPage
def pdf_txt(url):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
f = requests.get(url).content
fp = StringIO(f)
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos = set()
for page in PDFPage.get_pages(fp,
pagenos,
maxpages=maxpages,
password=password,
caching=caching,
check_extractable=True):
interpreter.process_page(page)
fp.close()
device.close()
str = retstr.getvalue()
retstr.close()
return str
txt=tpdf_txt('http://pythonscraping.com/pages/warandpeace/chapter1.pdf')
print txt
#如果pdf含有中文,输出到文件
#open('pdf.txt','wb').write(txt)
python readpdf.py
'''
CHAPTER I
"Well, Prince, so Genoa and Lucca are now just family estates of
theBuonapartes. But I warn you, if you don't tell me that this
means war,if you still try to defend the infamies and horrors
perpetrated bythat Antichrist- I really believe he is Antichrist- I will
havenothing more to do with you and you are no longer my friend,
no longermy 'faithful slave,' as you call yourself! But how do you
do? I seeI have frightened you- sit down and tell me all the news."
It was in July, 1805, and the speaker was the well-known
AnnaPavlovna Scherer, maid of honor and favorite of the
Empress MaryaFedorovna. With these words she greeted Prince
Vasili Kuragin, a manof high rank and importance, who was the
first to arrive at herreception. Anna Pavlovna had had a cough for
some days. She was, asshe said, suffering from la grippe; grippe
being then a new word inSt. Petersburg, used only by the elite.
All her invitations without exception, written in French,
anddelivered by a scarlet-liveried footman that morning, ran as
'''