Home >Backend Development >Python Tutorial >python method to determine web page encoding

python method to determine web page encoding

高洛峰
高洛峰Original
2017-02-25 13:35:131389browse

In web development, we often encounter web page crawling and analysis, and various languages ​​​​can complete this function. I like to use python to implement it, because python provides many mature modules, which can easily implement web crawling.

However, you will encounter encoding problems during the crawling process. Today we will look at how to determine the encoding of a web page:
The encoding format of many web pages on the Internet is different. Generally speaking, GBK, GB2312, UTF-8, etc.
After we obtain the data of the web page, we must first judge the encoding of the web page, and then we can uniformly convert the encoding of the captured content into an encoding that we can handle to avoid the occurrence of garbled code problems.

The following introduces two methods of judging web page encoding:

Summary: The second method is very accurate. It is best to use the python module to analyze the content when analyzing web page encoding. Accurate, but the method of analyzing meta header information is less accurate.

Method 1: Use the getparam method of the urllib module

##

import urllib
#autor:pythontab.com
fopen1 = urllib.urlopen('http://www.baidu.com').info()
print fopen1.getparam('charset')# baidu

Method 2: Use the chardet module

#如果你的python没有安装chardet模块,你需要首先安装一下chardet判断编码的模块哦 
#author:pythontab.com
import chardet 
import urllib
#先获取网页内容
data1 = urllib.urlopen('http://www.baidu.com').read()
#用chardet进行内容分析
chardit1 = chardet.detect(data1)
 
print chardit1['encoding'] # baidu

The above is the entire content of this article. I hope it will be helpful to everyone's study. I also hope that everyone will support the PHP Chinese website.


For more articles related to Python’s method of judging web page encoding, please pay attention to the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn