Home  >  Q&A  >  body text

编码 - Python 3.6中 'utf-8' codec can't decode byte invalid start byte?

Python 3.6中,网页信息解析失败,试了很多种编码,查看网页的编码方式也是utf-8。
错误信息:'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte?
还有就是第一个print终端里打印出来的unicode内容是[b'\x1f\x8b\x08\x00\x...]这种格式的,之前也有过这种情况,一个print打2个变量,就是b'\x, 如果分来2行打又变回了汉字。是因为什么原因呢?

# -*- coding: utf-8 -*-
import json , sqlite3
import urllib.request

url = ('http://wthrcdn.etouch.cn/weather_mini?city=%E4%B8%8A%E6%B5%B7')
resp = urllib.request.urlopen(url)
content = resp.read()

print(content)
print(type(content))
print(content.decode('utf-8'))
PHP中文网PHP中文网2741 days ago994

reply all(4)I'll reply

  • 阿神

    阿神2017-04-18 10:27:17

    After looking at the website, the data returned is gzip compressed data, so it needs to be decoded

    # coding=utf-8
    from io import BytesIO
    import gzip
    import urllib.request
    
    url = ('http://wthrcdn.etouch.cn/weather_mini?city=%E4%B8%8A%E6%B5%B7')
    resp = urllib.request.urlopen(url)
    content = resp.read() # content是压缩过的数据
    
    buff = BytesIO(content) # 把content转为文件对象
    f = gzip.GzipFile(fileobj=buff)
    res = f.read().decode('utf-8')
    print(res)
    

    reply
    0
  • 伊谢尔伦

    伊谢尔伦2017-04-18 10:27:17

    Isn’t requests easy to use?

    reply
    0
  • 伊谢尔伦

    伊谢尔伦2017-04-18 10:27:17

    It is recommended to use requestet, the code is as follows:

    import requests
    
    r = requests.get('http://wthrcdn.etouch.cn/weather_mini?city=%E4%B8%8A%E6%B5%B7')
    print(r.text)

    reply
    0
  • 阿神

    阿神2017-04-18 10:27:17

    It’s not a character encoding problem, just look at the Response headers you requested

    
    
        Status Code: 200 OK
        Access-Control-Allow-Headers: *
        Access-Control-Allow-Methods: *
        Access-Control-Allow-Origin: *
        Cache-Control: must-revalidate, max-age=300
        Connection: Keep-Alive
        Content-Encoding: gzip
        Content-Length: 443
        Date: Fri, 10 Mar 2017 03:20:46 GMT
        Fw-Cache-Status: hit
        Fw-Via: HTTP MISS from 58.59.19.99, DISK HIT from 183.131.161.27
        Server: Tengine/2.1.2
    
    

    It is gzip. If you use the standard library, you need to unzip it

    reply
    0
  • Cancelreply