首頁  >  問答  >  主體

文本处理 - 求教使用python库提取pdf的方法?

使用过pypdf 对英文pdf文档处理比较简单,但是对中文的支持好像不太好

使用过textract 看文档支持的格式比较多方法也比较简单,但是老师出错

-- coding: utf-8 --

import textract
import pyPdf
import pdf2text
import pdfminer
import chardet

text = textract.process("F:ll.pdf",method = 'pdfminer')
print text

这个 出错是编码问题

-- coding: utf-8 --

import textract
import pyPdf
import pdfminer
import chardet

text = textract.process("F:ll.pdf",method = 'pdfminer')
print text

这个出错类型不清楚

少使用了pdf2text库,但是出错情况好像不一样。

pdfminer库还没看过,看着好像麻烦一些, 求解一下解析提取中文的pdf的方法。谢谢

怪我咯怪我咯2740 天前549

全部回覆(1)我來回復

  • PHPz

    PHPz2017-04-18 10:26:58

    之前用過的pdfminer pip install pdfminer

    # -*- coding: utf-8 -*-
    from bs4 import BeautifulSoup
    import requests
    import re
    from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
    from pdfminer.converter import TextConverter
    from pdfminer.layout import LAParams
    from cStringIO  import StringIO
    #from io  import StringIO for python3
    from io import open
    from pdfminer.pdfpage import PDFPage
    def pdf_txt(url):
        rsrcmgr = PDFResourceManager()
        retstr = StringIO()
        codec = 'utf-8'
        laparams = LAParams()
        device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
        f = requests.get(url).content
        fp = StringIO(f)
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        password = ""
        maxpages = 0
        caching = True
        pagenos = set()
        for page in PDFPage.get_pages(fp,
                                      pagenos,
                                      maxpages=maxpages,
                                      password=password,
                                      caching=caching,
                                      check_extractable=True):
            interpreter.process_page(page)
        fp.close()
        device.close()
        str = retstr.getvalue()
        retstr.close()
        return str
    txt=tpdf_txt('http://pythonscraping.com/pages/warandpeace/chapter1.pdf')
    print txt
    #如果pdf含有中文,输出到文件
    #open('pdf.txt','wb').write(txt)
    
    python readpdf.py
    '''
    CHAPTER I
    "Well, Prince, so Genoa and Lucca are now just family estates of
    theBuonapartes. But I warn you, if you don't tell me that this
    means war,if you still try to defend the infamies and horrors
    perpetrated bythat Antichrist- I really believe he is Antichrist- I will
    havenothing more to do with you and you are no longer my friend,
    no longermy 'faithful slave,' as you call yourself! But how do you
    do? I seeI have frightened you- sit down and tell me all the news."
    It was in July, 1805, and the speaker was the well-known
    AnnaPavlovna Scherer, maid of honor and favorite of the
    Empress MaryaFedorovna. With these words she greeted Prince
    Vasili Kuragin, a manof high rank and importance, who was the
    first to arrive at herreception. Anna Pavlovna had had a cough for
    some days. She was, asshe said, suffering from la grippe; grippe
    being then a new word inSt. Petersburg, used only by the elite.
    All her invitations without exception, written in French,
    anddelivered by a scarlet-liveried footman that morning, ran as
    '''   

    回覆
    0
  • 取消回覆