python - beautifulsoup解析中文网页的编码问题

Question

对于同一个页面，几乎同样的代码，在Python3，windows8环境下能够正常解析运行。但是把代码移植到Ubuntu，Python2.7下面之后，会出现获取的网页不能被beautifulsoup解析，find_all('table')返回空节点的情况。
出问题的代码的一部分（可以运行）：

阿神 · Answer

Have you tried changing the parser?
The fault tolerance rate of python2.7's HTML parser is very poor.
lxml is recommended.

大家讲道理 · Answer

Well, this is mainly an encoding issue. . . If you don't understand the encoding problem of python, it is definitely a big pitfall.
When I saw these sentences, they seemed to have some problems:

1. mybytes = fp.read().decode('gbk').encode('utf-8')
2. soup = BeautifulSoup(mybytes,from_coding="uft-8")
3. print soup.original_encoding
4. print soup.prettify()

Among them,

No encoding conversion required, bs can accept any encoding, unicode is better. So even if the encoding is converted, it should only go to decode
bs instance construction usage is BeautifulSoup(html, 'html5lib'), the second parameter is the interpreter, not the encoding.
Just print soup and you will get the result. Whether to display Chinese or not is mainly related to encoding. The encoding conversion capability of bs is actually not that strong, so plain text calls will also cause problems
soup.prettify('utf-8') can ensure that the output encoding is correct.

python - beautifulsoup解析中文网页的编码问题

reply all(2)I'll reply