写python爬虫,做下载器时,发现部分网页(一部分可以)无法通过decode('utf-8)
去解码,查看网页,网页却是有<meta charset=UTF-8>
这句,说明是UTF-8编码,为何无法解码?
部分网页解码失败的错误代码:
Traceback (most recent call last):
File "E:/python爬虫/test.py", line 13, in <module>
print(data.decode('utf-8'))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
这里是我获取html数据并进行解码的相关代码:
url = 'http://wiki.52poke.com/wiki/%E8%B7%AF%E5%8D%A1%E5%88%A9%E6%AC%A7'
req = urllib.request.Request(url)
res = urllib.request.urlopen(req)
data = res.read()
print(data)
print(data.decode('utf-8'))
输出(这里是解码失败的网页的data数据输出的结果)(这里只贴出部分,太多了)
b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xed\xbdys\x1bG\x96/\xfa\xf7\xe8S\xa0\xe1\xcbi{\xc6\xd8wJB\x07 Q\xe3~\xaf\xdd\xa3\xb1=3vx\xfa9@\xa2D\xa2\x05\x02\xb8\x00\xa8\xc5=\xfd\x02\x94Lq\'\xb5P\xd4Bj\xa1,J\xd4FR\x12-q\x15#\xde\xfd&\x16\xaa\x00\xc4\xbd\x11\xfe\n\xef\x9c\xcc\xaaBU\xa1\xb0\x14\tR\x90\x94\x9e\x1e\xb1\x90U\x95u2\xf3\xe4\xd9\xf2\xe4/\x0f\xfd\xee\xe8\xbf\x1e\xf9\xe6\xbb\xe3\x1d\xa6\x9elo<x\xe0\x10\xfe1\xc5#\x89\xee\xc3?\xf6\x98\xa2\xb1\xf4\xe1x6m\xea\x8aG2\x99\xc3]\xf1\x18\x97\xc8Z\x12\xc9\xbff\xf0A.\x12\x85?\xbd\\6b\xea\xea\x89\xa43\\\xf6\xf0\xbf\x7fs\xcc\xe2\x87\xc2l,
输出(解码成功的网页的代码)
b'<!DOCTYPE html>\n<html lang=zh dir=ltr class=client-nojs>\n<head>\n<meta charset=UTF-8>\n<title>\xe6\x80\xaa\xe6\xb2\xb3\xe9\xa9\xac - \xe7\xa5\x9e\xe5\xa5\x87\xe5\xae\x9d\xe8\xb4\x9d\xe7\x99\xbe\xe7\xa7\x91</title>\n<script>document.documentElement.className = document.documentElement.className.replace( /(^|\\s)client-nojs(\\s|$)/, "$1client-js$2" );</script>\n<script>window.RLQ = window.RLQ || []; window.RLQ.push( function () {\nmw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"\xe6\x80\xaa\xe6\xb2\xb3\xe9\xa9\xac","wgTitle":"\xe6\x80\xaa\xe6\xb2\xb3\xe9\xa9\xac","wgCurRevisionId":651454,"wgRevisionId":651454,"wgArticleId":602,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["\xe6\x8b\xa5\xe6\x9c\x89\xe6\xb2\x99\xe6\xb5\x81\xe7\x89\xb9\xe6\x80\xa7\xe7\x9a\x84\xe7\xa5\x9e\xe5\xa5\x87\xe5\xae\x9d\xe8\xb4\x9d","\xe7\xa5\x9e\xe5\xa5\xa5\xe5\x9c\xb0\xe6\x96\xb9\xe5\xae\x9d\xe5\x8f\xaf\xe6\xa2\xa6","\xe5\x8d\xa1\xe6\xb4\x9b\xe6\x96\xaf\xe5\x9c\xb0\xe6\x96\xb9\xe5\xae\x9d\xe5\x8f\xaf\xe6\xa2\xa6",
搞不懂为何部分会有!DOCTYPE html>\n<html lang=zh dir=ltr class=client-nojs>\n<head>\n<meta charset=UTF-8>
这种,而部分则是\xkk
这种形式的代码?
搞了一早上了依旧不明白,我在猜是不是字节数的关系使得部分解码不了?希望有大神能解疑
迷茫2017-04-17 17:29:16
반환된 데이터는 gzip으로 압축되어 있으므로 먼저 압축을 풀어야 합니다. 이제 크롤러를 작성할 때 BeautifulSoup을 사용하는 방법을 알았으니 urllib 대신 요청을 사용해 보는 것이 어떨까요?