sublime-text - python爬虫编码问题

Question

跟着教程写了个爬虫，结果爬到的中文都是乱码的，应该怎么解决 python代码 {代码...} 爬取结果： (u'u539fu56fdu52a1u9662u5b98u5458uff1au804cu5de5u65e9u9000u4f11u53bbu8df3u5e7fu573au821eu662fu6d6au8d39', ...

巴扎黑 · Answer

試試不用元組

print h2, a

應該還是遺留的編碼問題

print的時候實際上呼叫了tuple的__str__（）

>>> h = u'你好'
>>> (h, 8).__str__()
"(u'\u4f60\u597d', 8)"

巴扎黑 · Answer

編碼方式不同造成，windows平台的編碼一般是gbk過著isoxxx，查閱一下web的編碼方式（chrome可查閱），然後將編碼轉為系統一致就ok了

高洛峰 · Answer

其實單獨輸出h2是可以輸出中文的，非要向你那樣輸出元組的話，參考下面程式碼

from __future__ import unicode_literals
#-*-coding:utf-8-*-
import requests
from bs4 import BeautifulSoup
res = requests.get('http://news.sina.com.cn/china/')
res.encoding='utf-8'
soup=BeautifulSoup(res.text,'html.parser')
for news in soup.select('.news-item'):
    if len(news.select('h2'))>0:
        h2=news.select('h2')[0].text
        a=news.select('a')[0]['href']
        test = str((h2, a))
        print(test.decode("unicode-escape"))

巴扎黑 · Answer

遇到編碼問題，還要是理解編碼的歷史淵源是是什麼，可以看看這篇文章， http://foofish.net/python-cha... 以後遇到編碼了就知道如何分析問題了。

大家讲道理 · Answer

<p>python3</p>

PHPz · Answer

u''開頭說明已經是unicode了，編碼沒有問題，只是你print的方式有問題,2.7的話改成這樣應該就沒問題了

print '%s,%s'%(h2, a)

高洛峰 · Answer

唸出來之後直接轉換成字串就可以了

PHP中文网 · Answer

印（h2 + a）

sublime-text - python爬虫编码问题

全部回覆(8)我來回復