Home > Article > Backend Development > [Python] Web Crawler (5): Usage details of urllib2 and website crawling techniques
I mentioned a simple introduction to urllib2 earlier, and here are some details of how to use urllib2.
1. Proxy settings
urllib2 will use the environment variable http_proxy to set HTTP Proxy by default.
If you want to explicitly control Proxy in the program without being affected by environment variables, you can use a proxy.
Create a new test14 to implement a simple proxy Demo:
import urllib2 enable_proxy = True proxy_handler = urllib2.ProxyHandler({"http" : 'http://some-proxy.com:8080'}) null_proxy_handler = urllib2.ProxyHandler({}) if enable_proxy: opener = urllib2.build_opener(proxy_handler) else: opener = urllib2.build_opener(null_proxy_handler) urllib2.install_opener(opener)
One detail to note here is that using urllib2.install_opener() will set the global opener of urllib2.
This will be very convenient for later use, but it cannot provide more detailed control, for example, if you want to use two different Proxy settings in the program, etc.
A better approach is not to use install_opener to change the global settings, but to directly call the opener's open method instead of the global urlopen method.
2.Timeout setting
In the old version of Python (before Python2.6), the API of urllib2 does not expose the Timeout setting. To set the Timeout value, you can only change the global Timeout of the Socket. value.
import urllib2 import socket socket.setdefaulttimeout(10) # 10 秒钟后超时 urllib2.socket.setdefaulttimeout(10) # 另一种方式
After Python 2.6, the timeout can be set directly through the timeout parameter of urllib2.urlopen().
import urllib2 response = urllib2.urlopen('http://www.google.com', timeout=10)
3. Add a specific Header to the HTTP Request
To add a header, you need to use the Request object:
import urllib2 request = urllib2.Request('http://www.baidu.com/') request.add_header('User-Agent', 'fake-client') response = urllib2.urlopen(request) print response.read()
Pay special attention to some headers, the server These headers will be checked
User-Agent: Some servers or Proxy will use this value to determine whether the request is made by the browser
Content-Type: When using the REST interface, the server will check this value, use To determine how the content in the HTTP Body should be parsed. Common values are:
application/xml: Use
application/json when calling XML RPC, such as RESTful/SOAP: Use
application/x-www-form-urlencoded when calling JSON RPC: Used when the browser submits a web form
When using the RESTful or SOAP service provided by the server, the Content-Type setting error will cause the server to deny service
4.Redirect
urllib2 By default, the redirect action will be automatically performed for HTTP 3XX return codes without manual configuration. To detect whether a redirect action has occurred, just check whether the URL of the Response and the URL of the Request are consistent.
import urllib2 my_url = 'http://www.google.cn' response = urllib2.urlopen(my_url) redirected = response.geturl() == my_url print redirected my_url = 'http://rrurl.cn/b1UZuP' response = urllib2.urlopen(my_url) redirected = response.geturl() == my_url print redirected
If you don’t want to redirect automatically, in addition to using the lower-level httplib library, you can also customize the HTTPRedirectHandler class.
import urllib2 class RedirectHandler(urllib2.HTTPRedirectHandler): def http_error_301(self, req, fp, code, msg, headers): print "301" pass def http_error_302(self, req, fp, code, msg, headers): print "303" pass opener = urllib2.build_opener(RedirectHandler) opener.open('http://rrurl.cn/b1UZuP')
5.Cookie
urllib2 also handles cookies automatically. If you need to get the value of a certain Cookie item, you can do this:
import urllib2 import cookielib cookie = cookielib.CookieJar() opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie)) response = opener.open('http://www.baidu.com') for item in cookie: print 'Name = '+item.name print 'Value = '+item.value
After running, the cookie value for visiting Baidu will be output:
6. Use HTTP PUT and DELETE methods
urllib2 only supports HTTP GET and POST methods. If you want to use HTTP PUT and DELETE, you can only use the lower-level httplib library. Even so, we can still enable urllib2 to issue PUT or DELETE requests in the following way:
import urllib2 request = urllib2.Request(uri, data=data) request.get_method = lambda: 'PUT' # or 'DELETE' response = urllib2.urlopen(request)
7. Get the HTTP return code
For 200 OK, just use urlopen The HTTP return code can be obtained by using the getcode() method of the returned response object. But for other return codes, urlopen will throw an exception. At this time, it is necessary to check the code attribute of the exception object:
import urllib2 try: response = urllib2.urlopen('http://bbs.csdn.net/why') except urllib2.HTTPError, e: print e.code
8.Debug Log
When using urllib2, you can open the debug Log through the following method, so that the contents of the send and receive packets are It will be printed on the screen to facilitate debugging. Sometimes you can save the work of capturing packets
import urllib2 httpHandler = urllib2.HTTPHandler(debuglevel=1) httpsHandler = urllib2.HTTPSHandler(debuglevel=1) opener = urllib2.build_opener(httpHandler, httpsHandler) urllib2.install_opener(opener) response = urllib2.urlopen('http://www.google.com')
In this way, you can see the contents of the transmitted data packets:
9. Form processing
It is necessary to fill in the form when logging in. How to fill in the form?
First use the tool to intercept the content of the form to be filled out.
For example, I usually use the firefox+httpfox plug-in to see what packages I have sent.
Taking verycd as an example, first find the POST request you sent and the POST form items.
You can see that for verycd, you need to fill in username, password, continueURI, fk, login_submit. Among them, fk is randomly generated (actually not too random, it looks like it is generated by simply encoding the epoch time). It needs to be obtained from the web page, which means that you must first visit the web page and use tools such as regular expressions to intercept the fk item in the returned data. As the name suggests, continueURI can be written casually, while login_submit is fixed, which can be seen from the source code. There is also username and password, which is obvious:
# -*- coding: utf-8 -*- import urllib import urllib2 postdata=urllib.urlencode({ 'username':'汪小光', 'password':'why888', 'continueURI':'http://www.verycd.com/', 'fk':'', 'login_submit':'登录' }) req = urllib2.Request( url = 'http://secure.verycd.com/signin', data = postdata ) result = urllib2.urlopen(req) print result.read()
10. Disguise as a browser to visit
Some websites are disgusted with the visit of crawlers, so they reject requests from crawlers
At this time we need Pretending to be a browser, this can be achieved by modifying the header in the http package
#… headers = { 'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6' } req = urllib2.Request( url = 'http://secure.verycd.com/signin/*/http://www.verycd.com/', data = postdata, headers = headers ) #...
11. Dealing with "anti-hotlinking"
Some sites have so-called anti-hotlinking settings. In fact, it is very simple to put it bluntly. ,
It is to check whether the referer site in the header you send the request is its own,
So we only need to change the referer of the headers to the Just use the website, take cnbeta as an example:
#... headers = { 'Referer':'http://www.cnbeta.com/articles' } #...
headers is a dict data structure, you can put in any header you want to make some disguise.
For example, some websites like to read the X-Forwarded-For in the header to see their real IP. You can directly change the X-Forwarde-For.
The above is [Python] Web Crawler (5): Usage details of urllib2 and website crawling techniques. For more related content, please pay attention to the PHP Chinese website (www.php.cn)!