Rumah  >  Soal Jawab  >  teks badan

php - 如何搜索PDF内容?

客户要求做全站的关键字搜索,包括PDF文档内容也要能搜到,目前的解决办法是将PDF转换成文本,写入数据库,然后搜索数据库字段。如果PDF不是文本内容,无法转换肯定无法搜索,是否有更好的解决方案?

大家讲道理大家讲道理2771 hari yang lalu363

membalas semua(2)saya akan balas

  • 怪我咯

    怪我咯2017-04-11 09:44:09

    额,使用标签呢?怎么还有全站搜pdf的功能啊,关注一下

    balas
    0
  • 迷茫

    迷茫2017-04-11 09:44:09

    #python convert pdf to text
    # -*- coding: utf-8 -*-
    from bs4 import BeautifulSoup
    import requests
    import re
    from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
    from pdfminer.converter import TextConverter
    from pdfminer.layout import LAParams
    from cStringIO  import StringIO
    #from io  import StringIO for python3
    from io import open
    from pdfminer.pdfpage import PDFPage
    def pdf_txt(url):
        rsrcmgr = PDFResourceManager()
        retstr = StringIO()
        codec = 'utf-8'
        laparams = LAParams()
        device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
        f = requests.get(url).content
        fp = StringIO(f)
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        password = ""
        maxpages = 0
        caching = True
        pagenos = set()
        for page in PDFPage.get_pages(fp,
                                      pagenos,
                                      maxpages=maxpages,
                                      password=password,
                                      caching=caching,
                                      check_extractable=True):
            interpreter.process_page(page)
        fp.close()
        device.close()
        str = retstr.getvalue()
        retstr.close()
        return str
    txt=pdf_txt('http://pythonscraping.com/pages/warandpeace/chapter1.pdf')
    print txt
    #如果pdf含有中文,命令行输出乱码,可以输出到文件
    #open('pdf.txt','wb').write(txt)
    '''
    CHAPTER I
    "Well, Prince, so Genoa and Lucca are now just family estates of
    theBuonapartes. But I warn you, if you don't tell me that this
    means war,if you still try to defend the infamies and horrors
    perpetrated bythat Antichrist- I really believe he is Antichrist- I will
    havenothing more to do with you and you are no longer my friend,
    no longermy 'faithful slave,' as you call yourself! But how do you
    do? I seeI have frightened you- sit down and tell me all the news."
    It was in July, 1805, and the speaker was the well-known
    AnnaPavlovna Scherer, maid of honor and favorite of the
    Empress MaryaFedorovna. With these words she greeted Prince
    Vasili Kuragin, a manof high rank and importance, who was the
    first to arrive at herreception. Anna Pavlovna had had a cough for
    some days. She was, asshe said, suffering from la grippe; grippe
    being then a new word inSt. Petersburg, used only by the elite.
    All her invitations without exception, written in French,
    anddelivered by a scarlet-liveried footman that morning, ran as
    ''' 

    balas
    0
  • Batalbalas