Home >Backend Development >Python Tutorial >Use python2 and python3 to disguise browsers and crawl web content

Use python2 and python3 to disguise browsers and crawl web content

高洛峰
高洛峰Original
2016-10-18 13:55:381972browse

Python web page crawling function is very powerful. You can easily crawl web page content using urllib or urllib2. But many times we have to pay attention to the fact that many websites may have anti-collection functions, so it is not so easy to capture the content you want.

Today I will share how to simulate browsers to skip blocking and crawl when downloading python2 and python3.

The most basic crawling:

#! /usr/bin/env python
# -*- coding=utf-8 -*-
# @Author pythontab
import urllib.request
url = "http://www.pythontab.com"
html = urllib.request.urlopen(url).read()
print(html)

But...some websites cannot be crawled and have anti-collection settings, so we have to change the method

python2 (the latest stable version python2.7)

#! /usr/bin/env python
# -*- coding=utf-8 -*-
# @Author pythontab.com
import urllib2
url="http://pythontab.com"
req_header = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
             'Accept':'text/html;q=0.9,*/*;q=0.8',
             'Accept-Charset':'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
             'Accept-Encoding':'gzip',
             'Connection':'close',
             'Referer':None #注意如果依然不能抓取的话,这里可以设置抓取网站的host
             }
req_timeout = 5
req = urllib2.Request(url,None,req_header)
resp = urllib2.urlopen(req,None,req_timeout)
html = resp.read()
print(html)

python3 (Latest stable version python3.3)

#! /usr/bin/env python
# -*- coding=utf-8 -*-
# @Author pythontab
import urllib.request
  
url = "http://www.pythontab.com"
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
             'Accept':'text/html;q=0.9,*/*;q=0.8',
             'Accept-Charset':'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
             'Accept-Encoding':'gzip',
             'Connection':'close',
             'Referer':None #注意如果依然不能抓取,这里可以设置抓取网站的host
             }
  
opener = urllib.request.build_opener()
opener.addheaders = [headers]
data = opener.open(url).read()
print(data)


Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn