Home  >  Article  >  Backend Development  >  Garbled code problem when python captures and saves html pages

Garbled code problem when python captures and saves html pages

高洛峰
高洛峰Original
2017-03-01 13:25:221569browse

When using Python to capture html pages and save them, there is often a problem that the content of the captured web pages is garbled. The reason for this problem is that on the one hand, there is a problem with the encoding settings in your own code, and on the other hand, when the encoding settings are correct, the actual encoding of the web page does not match the marked encoding. The encoding marked on the html page is here:

Copy code The code is as follows:

Here is a simple solution: use chardet to determine the real encoding of the web page, and at the same time determine the marking encoding from the info returned by the url request. If the two encodings are different, use the bs module to expand to GB18030 encoding; if they are the same, write the file directly (the system default encoding is set here to utf-8).

import urllib2
import sys
import bs4
import chardet
reload(sys)
sys.setdefaultencoding('utf-8')
def download(url):
  htmlfile = open('test.html','w')
  try:
    result = urllib2.urlopen(url)
    content = result.read()
    info = result.info()
    result.close()
  except Exception,e:
    print 'download error!!!'
    print e
  else:
    if content != None:
      charset1 = (chardet.detect(content))['encoding'] #real encoding type
      charset2 = info.getparam('charset') #declared encoding type
      print charset1,' ', charset2
      # case1: charset is not None.
      if charset1 != None and charset2 != None and charset1.lower() != charset2.lower():
        newcont = bs4.BeautifulSoup(content, from_encoding='GB18030')  #coding: GB18030
        for cont in newcont:
          htmlfile.write('%s\n'%cont)
      # case2: either charset is None, or charset is the same.
      else:
        #print sys.getdefaultencoding()
        htmlfile.write(content) #default coding: utf-8
  htmlfile.close()
if __name__ == "__main__":
  url = 'http://www.php.cn'
  download(url)

The obtained test.html file is opened as follows. You can see that it is stored in UTF-8 BOM-free encoding format, which is the default we set. Encoding:

Garbled code problem when python captures and saves html pages

#For more articles related to the garbled code problem when python crawls and saves html pages, please pay attention to the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn