search

Home  >  Q&A  >  body text

网页爬虫 - 如何确定一个python爬取得网页是否是被压缩的?

我今天尝试爬去糗事百科的。F12后发现REquest headers中Accept-Encoding:gzip, deflate, sdch 我就以为是被压缩的,后来

response=urllib.request.urlopen(Request
print(response.info().get('Content-Encoding'))

返回的是None,请问到底如何确定否被压缩

黄舟黄舟2800 days ago677

reply all(1)I'll reply

  • PHP中文网

    PHP中文网2017-04-17 15:39:41

    You need to set Accept-Encoding when crawling before this header will be compressed.

    In the browser Accept-Encoding:gzip, deflate, sdch tells the website that the browser supports these three compression methods: gzip, deflate, and sdch. In other words, this does not represent the compression method supported by the website, but the compression method supported by the browser.

    The website will choose one of the supported compression methods to return, and the compression method is the value of Content-Encoding. The browser will select the corresponding decompression method based on this value.

    Yibai supports gzip, but if Accept-Encoding is not set, no compression will occur.

    python3#!/usr/bin/env python3
    from urllib import request
    
    USER_AGENT = r'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.107 Safari/537.36'
    
    req = request.Request(r'http://www.qiushibaike.com/', headers={'User-Agent': USER_AGENT, 'Accept-Encoding': 'gzip'})
    res = request.urlopen(req)
    
    print(res.info().get('Content-Encoding'))
    

    The output of the above script is

    gzip
    

    reply
    0
  • Cancelreply