我今天尝试爬去糗事百科的。F12后发现REquest headers中Accept-Encoding:gzip, deflate, sdch 我就以为是被压缩的,后来
PHP中文网2017-04-17 15:39:41
You need to set Accept-Encoding
when crawling before this header will be compressed.
In the browser Accept-Encoding:gzip, deflate, sdch
tells the website that the browser supports these three compression methods: gzip
, deflate
, and sdch
. In other words, this does not represent the compression method supported by the website, but the compression method supported by the browser.
The website will choose one of the supported compression methods to return, and the compression method is the value of Content-Encoding
. The browser will select the corresponding decompression method based on this value.
Yibai supports gzip
, but if Accept-Encoding
is not set, no compression will occur.
#!/usr/bin/env python3 from urllib import request USER_AGENT = r'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.107 Safari/537.36' req = request.Request(r'http://www.qiushibaike.com/', headers={'User-Agent': USER_AGENT, 'Accept-Encoding': 'gzip'}) res = request.urlopen(req) print(res.info().get('Content-Encoding'))
The output of the above script is