Home  >  Q&A  >  body text

python - beautifulsoup解析中文网页的编码问题

对于同一个页面,几乎同样的代码,在Python3,windows8环境下能够正常解析运行。但是把代码移植到Ubuntu,Python2.7下面之后,会出现获取的网页不能被beautifulsoup解析,find_all('table')返回空节点的情况。
出问题的代码的一部分(可以运行):

python#coding:utf-8
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
import urllib2
from bs4 import BeautifulSoup
postdata = "T1=&T2=1&T3=&T4=&T5=&APPDate=&T7=&T8=&T9=&PRDate=&T11=&SQDate=&JDDate=&T14=&T15=&T16=&T17=&SDDate=&T19=&T20=&T21=&D1=%B8%B4%C9%F3&D2=jdr&D3=%C9%FD%D0%F2&C1=fm&C2=&C3=&page=70"
postdata = postdata.encode('utf-8')
headers = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6','Referer':'http://app.sipo-reexam.gov.cn/reexam_out/searchdoc/searchfs.jsp'}
req = urllib2.Request(
      url = "http://app.sipo-reexam.gov.cn/reexam_out/searchdoc/searchfs.jsp",
      headers = headers,
      data = postdata)
fp  = urllib2.urlopen(req)
mybytes = fp.read().decode('gbk').encode('utf-8')
soup = BeautifulSoup(mybytes,from_coding="uft-8")
print soup.original_encoding
print soup.prettify()

求指点一二

大家讲道理大家讲道理2717 days ago344

reply all(2)I'll reply

  • 阿神

    阿神2017-04-17 14:28:52

    Have you tried changing the parser?
    The fault tolerance rate of python2.7's HTML parser is very poor.
    lxml is recommended.

    reply
    0
  • 大家讲道理

    大家讲道理2017-04-17 14:28:52

    Well, this is mainly an encoding issue. . . If you don't understand the encoding problem of python, it is definitely a big pitfall.
    When I saw these sentences, they seemed to have some problems:

    1. mybytes = fp.read().decode('gbk').encode('utf-8')
    2. soup = BeautifulSoup(mybytes,from_coding="uft-8")
    3. print soup.original_encoding
    4. print soup.prettify()

    Among them,

    1. No encoding conversion required, bs can accept any encoding, unicode is better. So even if the encoding is converted, it should only go to decode

    2. bs instance construction usage is BeautifulSoup(html, 'html5lib'), the second parameter is the interpreter, not the encoding.

    3. Just print soup and you will get the result. Whether to display Chinese or not is mainly related to encoding. The encoding conversion capability of bs is actually not that strong, so plain text calls will also cause problems

    4. soup.prettify('utf-8') can ensure that the output encoding is correct.

    reply
    0
  • Cancelreply