Home  >  Article  >  Backend Development  >  Detailed explanation and examples of python urllib2

Detailed explanation and examples of python urllib2

高洛峰
高洛峰Original
2016-10-18 09:10:051211browse

urllib2 is a component of Python that obtains URLs (Uniform Resource Locators). It provides a very simple interface in the form of the urlopen function, which has the ability to obtain URLs using different protocols. It also provides a more complex interface to handle general situations, such as: basic authentication, cookies, proxies and other.

They are provided through objects of handlers and openers.

urllib2 supports obtaining URLs in different formats (strings defined before ":" in the URL, for example: "ftp" is the prefix of "ftp:python.ort/"), which use their related network protocols (such as FTP, HTTP)

to obtain. This tutorial focuses on the most widespread application - HTTP.

For simple applications, urlopen is very easy to use. But when you encounter errors or exceptions when opening HTTP URLs, you'll need some understanding of Hypertext Transfer Protocol (HTTP).

The most authoritative HTTP document is of course RFC 2616 (http://rfc.net/rfc2616.html). This is a technical document, so it's not easy to read. The purpose of this HOWTO tutorial is to show how to use urllib2,

and provide enough HTTP details to help you understand. It is not a documentation of urllib2, but plays an auxiliary role.

Getting URLs

The simplest way to use urllib2 will be as follows

import urllib2 
response = urllib2.urlopen('http://python.org/') 
html = response.read()

Many applications of urllib2 are that simple (remember, in addition to "http:", URLs can also use "ftp:", "file :" and so on instead). But this article teaches more complex applications of HTTP.

HTTP is based on the request and response mechanism-the client makes a request and the server provides a response. urllib2 uses a Request object to map the HTTP request you make. In its simplest form of use, you will create a Request object with the

address you want to request. By calling urlopen and passing in the Request object, a related request will be returned. response object, this response object is like a file object, so you can call .read() in Response.

import urllib2 
req = urllib2.Request('http://www.pythontab.com') 
response = urllib2.urlopen(req) 
the_page = response.read()

Remember that urllib2 uses the same interface to handle all URL headers. For example you can create an ftp request like below.

req = urllib2.Request('ftp://example.com/')

Allows you to do two additional things when making HTTP requests. First, you can send data form data, and second, you can send additional information about the data or itself ("metadata") to the server. This data is sent as HTTP "headers".

Let’s see how these are sent.

Data

Sometimes you want to send some data to a URL (usually the URL is hooked with a CGI [Common Gateway Interface] script, or other WEB application). In HTTP, this is often sent using the well-known POST request. This is usually done by your browser when you submit an HTML form.

Not all POSTs come from forms, you can use POST to submit arbitrary data to your own program. For general HTML forms, data needs to be encoded into a standard form. Then pass it to the Request object as the data parameter. Encoding works using urllib functions instead of urllib2.

import urllib 
import urllib2 
url = 'http://www.pythontab.com' 
values = {'name' : 'Michael Foord', 
          'location' : 'pythontab', 
          'language' : 'Python' } 
data = urllib.urlencode(values) 
req = urllib2.Request(url, data) 
response = urllib2.urlopen(req) 
the_page = response.read()

Remember that sometimes other encodings are needed (e.g. uploading files from HTML - see http://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13 HTML Specification , detailed instructions for Form Submission).

If ugoni does not transmit the data parameter, urllib2 uses GET request method. The difference between GET and POST requests is that POST requests usually have "side effects", they change the state of the system in some way (such as submitting a pile of garbage to your door).

Although the HTTP standard makes it clear that POSTs usually have side effects and GET requests do not, there is nothing to prevent GET requests from having side effects and similarly POST requests may not have side effects. Data can also be sent by encoding it in the URL itself in the Get request

.

See the example below

>>> import urllib2 
>>> import urllib 
>>> data = {} 
>>> data['name'] = 'Somebody Here' 
>>> data['location'] = 'pythontab' 
>>> data['language'] = 'Python' 
>>> url_values = urllib.urlencode(data) 
>>> print url_values 
name=blueelwang+Here&language=Python&location=pythontab 
>>> url = 'http://www.pythontab.com' 
>>> full_url = url + '?' + url_values 
>>> data = urllib2.open(full_url)

Headers

We will discuss specific HTTP headers here to illustrate how to add headers to your HTTP request.

There are some sites that don’t like to be accessed by programs (non-human access), or send different versions of content to different browsers. The default urllib2 identifies itself as "Python-urllib/x.y" (x and y are the Python major and minor version numbers, such as Python-urllib/2.5),

This identity may confuse sites, or simply not work. The browser confirms its identity through the User-Agent header. When you create a request object, you can give it a dictionary containing the header data. The example below sends the same content as above, but impersonates itself as Internet Explorer.

import urllib 
import urllib2 
url = 'http://www.pythontab.com' 
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' 
  
values = {'name' : 'Michael Foord', 
          'location' : 'pythontab', 
          'language' : 'Python' } 
headers = { 'User-Agent' : user_agent } 
data = urllib.urlencode(values) 
req = urllib2.Request(url, data, headers) 
response = urllib2.urlopen(req) 
the_page = response.read()

The response object also has two useful methods. Looking at the info and geturl sections below, we will see what happens when an error occurs.

Handle Exceptions handling exceptions

When urlopen cannot handle a response, a urlError is generated (but common Python APIs exceptions such as ValueError, TypeError, etc. will also be generated at the same time).

HTTPError is a subclass of urlError, usually generated in specific HTTP URLs.

URLError

通常,URLError在没有网络连接(没有路由到特定服务器),或者服务器不存在的情况下产生。这种情况下,异常同样会带有"reason"属性,它是一个tuple,包含了一个错误号和一个错误信息。

例如

>>> req = urllib2.Request('http://www.pythontab.com') 
>>> try: urllib2.urlopen(req) 
>>> except URLError, e: 
>>>    print e.reason 
>>>

   

(4, 'getaddrinfo failed') 

HTTPError

服务器上每一个HTTP 应答对象response包含一个数字"状态码"。有时状态码指出服务器无法完成请求。默认的处理器会为你处理一部分这种应答(例如:假如response是一个"重定向",需要客户端从别的地址获取文档

,urllib2将为你处理)。其他不能处理的,urlopen会产生一个HTTPError。典型的错误包含"404"(页面无法找到),"403"(请求禁止),和"401"(带验证请求)。

请看RFC 2616 第十节有所有的HTTP错误码

HTTPError实例产生后会有一个整型'code'属性,是服务器发送的相关错误号。

Error Codes错误码

因为默认的处理器处理了重定向(300以外号码),并且100-299范围的号码指示成功,所以你只能看到400-599的错误号码。

BaseHTTPServer.BaseHTTPRequestHandler.response是一个很有用的应答号码字典,显示了RFC 2616使用的所有的应答号。这里为了方便重新展示该字典。(译者略)

当一个错误号产生后,服务器返回一个HTTP错误号,和一个错误页面。你可以使用HTTPError实例作为页面返回的应答对象response。这表示和错误属性一样,它同样包含了read,geturl,和info方法。

>>> req = urllib2.Request('http://www.python.org/fish.html') 
>>> try: 
>>>     urllib2.urlopen(req) 
>>> except URLError, e: 
>>>     print e.code 
>>>     print e.read() 
>>>

    

404 
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" 
    "http://www.w3.org/TR/html4/loose.dtd"> 
<?xml-stylesheet href="./css/ht2html.css" 
    type="text/css"?> 
<html><head><title>Error 404: File Not Found</title> 
...... etc...

   

Wrapping it Up包装

所以如果你想为HTTPError或URLError做准备,将有两个基本的办法。我则比较喜欢第二种。

第一个:

from urllib2 import Request, urlopen, URLError, HTTPError 
req = Request(someurl) 
try: 
    response = urlopen(req) 
except HTTPError, e: 
    print &#39;The server couldn/&#39;t fulfill the request.&#39; 
    print &#39;Error code: &#39;, e.code 
except URLError, e: 
    print &#39;We failed to reach a server.&#39; 
    print &#39;Reason: &#39;, e.reason 
else: 
    # everything is fine

   

注意:except HTTPError 必须在第一个,否则except URLError将同样接受到HTTPError。 

第二个:

from urllib2 import Request, urlopen, URLError 
req = Request(someurl) 
try: 
    response = urlopen(req) 
except URLError, e: 
    if hasattr(e, &#39;reason&#39;): 
        print &#39;We failed to reach a server.&#39; 
        print &#39;Reason: &#39;, e.reason 
    elif hasattr(e, &#39;code&#39;): 
        print &#39;The server couldn/&#39;t fulfill the request.&#39; 
        print &#39;Error code: &#39;, e.code 
else: 
    # everything is fine

info and geturl

urlopen返回的应答对象response(或者HTTPError实例)有两个很有用的方法info()和geturl()

geturl -- 这个返回获取的真实的URL,这个很有用,因为urlopen(或者opener对象使用的)或许

会有重定向。获取的URL或许跟请求URL不同。

info -- 这个返回对象的字典对象,该字典描述了获取的页面情况。通常是服务器发送的特定头headers。目前是httplib.HTTPMessage 实例。

经典的headers包含"Content-length","Content-type",和其他。查看Quick Reference to HTTP Headers(http://www.cs.tut.fi/~jkorpela/http.html)

获取有用的HTTP头列表,以及它们的解释意义。

Openers和Handlers

当你获取一个URL你使用一个opener(一个urllib2.OpenerDirector的实例,urllib2.OpenerDirector可能名字可能有点让人混淆。)正常情况下,我们

使用默认opener -- 通过urlopen,但你能够创建个性的openers,Openers使用处理器handlers,所有的“繁重”工作由handlers处理。每个handlers知道

如何通过特定协议打开URLs,或者如何处理URL打开时的各个方面,例如HTTP重定向或者HTTP cookies。

如果你希望用特定处理器获取URLs你会想创建一个openers,例如获取一个能处理cookie的opener,或者获取一个不重定向的opener。

要创建一个 opener,实例化一个OpenerDirector,然后调用不断调用.add_handler(some_handler_instance).

同样,可以使用build_opener,这是一个更加方便的函数,用来创建opener对象,他只需要一次函数调用。

build_opener默认添加几个处理器,但提供快捷的方法来添加或更新默认处理器。

其他的处理器handlers你或许会希望处理代理,验证,和其他常用但有点特殊的情况。

install_opener 用来创建(全局)默认opener。这个表示调用urlopen将使用你安装的opener。

Opener对象有一个open方法,该方法可以像urlopen函数那样直接用来获取urls:通常不必调用install_opener,除了为了方便。

Basic Authentication 基本验证

为了展示创建和安装一个handler,我们将使用HTTPBasicAuthHandler,为了更加细节的描述本主题--包含一个基础验证的工作原理。

请看Basic Authentication Tutorial(http://www.voidspace.org.uk/python/articles/authentication.shtml)

当需要基础验证时,服务器发送一个header(401错误码) 请求验证。这个指定了scheme 和一个‘realm’,看起来像这样:Www-authenticate: SCHEME realm="REALM".

例如

Www-authenticate: Basic realm="cPanel Users"

客户端必须使用新的请求,并在请求头里包含正确的姓名和密码。这是“基础验证”,为了简化这个过程,我们可以创建一个HTTPBasicAuthHandler的实例,并让opener使用这个

handler。

HTTPBasicAuthHandler使用一个密码管理的对象来处理URLs和realms来映射用户名和密码。如果你知道realm(从服务器发送来的头里)是什么,你就能使用HTTPPasswordMgr。

通常人们不关心realm是什么。那样的话,就能用方便的HTTPPasswordMgrWithDefaultRealm。这个将在你为URL指定一个默认的用户名和密码。这将在你为特定realm提供一个其他组合时

得到提供。我们通过给realm参数指定None提供给add_password来指示这种情况。

最高层次的URL是第一个要求验证的URL。你传给.add_password()更深层次的URLs将同样合适。

# 创建一个密码管理者 
password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm() 
# 添加用户名和密码 
# 如果知道 realm, 我们可以使用他代替 ``None``. 
top_level_url = "http://example.com/foo/" 
password_mgr.add_password(None, top_level_url, username, password) 
handler = urllib2.HTTPBasicAuthHandler(password_mgr) 
# 创建 "opener" (OpenerDirector 实例) 
opener = urllib2.build_opener(handler) 
# 使用 opener 获取一个URL 
opener.open(a_url) 
# 安装 opener. 
# 现在所有调用 urllib2.urlopen 将用我们的 opener. 
urllib2.install_opener(opener)

   

注意:以上的例子我们仅仅提供我们的HHTPBasicAuthHandler给build_opener。默认的openers有正常状况的handlers--ProxyHandler,UnknownHandler,HTTPHandler,HTTPDefaultErrorHandler, HTTPRedirectHandler, FTPHandler, FileHandler, HTTPErrorProcessor。

top_level_url 实际上可以是完整URL(包含"http:",以及主机名及可选的端口号)例如:http://example.com/,也可以是一个“authority”(即主机名和可选的

包含端口号)例如:“example.com” or “example.com:8080”(后者包含了端口号)。权限验证,如果递交的话不能包含"用户信息"部分,例如:

“joe@password:example.com”是错误的。

Proxies代理urllib 将自动监测你的代理设置并使用他们。这个通过ProxyHandler这个在正常处理器链中的对象来处理。通常,那工作的很好。但有时不起作用

。其中一个方法便是安装我们自己的代理处理器ProxyHandler,并不定义代理。这个跟使用Basic Authentication 处理器很相似。

>>> proxy_support = urllib.request.ProxyHandler({}) 
>>> opener = urllib.request.build_opener(proxy_support) 
>>> urllib.request.install_opener(opener)

   

注意:

此时urllib.request不支持通过代理获取https地址。但,这个可以通过扩展urllib.request达到目的。

Sockets and Layers

Python支持获取网络资源是分层结构。urllib 使用http.client库,再调用socket库实现。

在Python2.3你可以指定socket的等待回应超时时间。这个在需要获取网页的应用程序里很有用。默认的socket模型没有超时和挂起。现在,socket超时没有暴露

给http.client或者urllib.request层。但你可以给所有的sockets设置全局的超时。


Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn