Rumah > Soal Jawab > teks badan
客户要求做全站的关键字搜索,包括PDF文档内容也要能搜到,目前的解决办法是将PDF转换成文本,写入数据库,然后搜索数据库字段。如果PDF不是文本内容,无法转换肯定无法搜索,是否有更好的解决方案?
迷茫2017-04-11 09:44:09
#python convert pdf to text
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests
import re
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from cStringIO import StringIO
#from io import StringIO for python3
from io import open
from pdfminer.pdfpage import PDFPage
def pdf_txt(url):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
f = requests.get(url).content
fp = StringIO(f)
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos = set()
for page in PDFPage.get_pages(fp,
pagenos,
maxpages=maxpages,
password=password,
caching=caching,
check_extractable=True):
interpreter.process_page(page)
fp.close()
device.close()
str = retstr.getvalue()
retstr.close()
return str
txt=pdf_txt('http://pythonscraping.com/pages/warandpeace/chapter1.pdf')
print txt
#如果pdf含有中文,命令行输出乱码,可以输出到文件
#open('pdf.txt','wb').write(txt)
'''
CHAPTER I
"Well, Prince, so Genoa and Lucca are now just family estates of
theBuonapartes. But I warn you, if you don't tell me that this
means war,if you still try to defend the infamies and horrors
perpetrated bythat Antichrist- I really believe he is Antichrist- I will
havenothing more to do with you and you are no longer my friend,
no longermy 'faithful slave,' as you call yourself! But how do you
do? I seeI have frightened you- sit down and tell me all the news."
It was in July, 1805, and the speaker was the well-known
AnnaPavlovna Scherer, maid of honor and favorite of the
Empress MaryaFedorovna. With these words she greeted Prince
Vasili Kuragin, a manof high rank and importance, who was the
first to arrive at herreception. Anna Pavlovna had had a cough for
some days. She was, asshe said, suffering from la grippe; grippe
being then a new word inSt. Petersburg, used only by the elite.
All her invitations without exception, written in French,
anddelivered by a scarlet-liveried footman that morning, ran as
'''