


Before starting the following content, let’s first explain the two methods in urllib2: info and geturl
The response object response (or HTTPError instance) returned by urlopen has two very useful methods info () and geturl()
1.geturl():
This returns the real URL obtained. This is very useful because urlopen (or used by the opener object) may There are redirects. The obtained URL may be different from the request URL.
Take a hyperlink in Renren as an example,
We build a urllib2_test10.py to compare the original URL and the redirected link:
from urllib2 import Request, urlopen, URLError, HTTPError old_url = 'http://rrurl.cn/b1UZuP' req = Request(old_url) response = urlopen(req) print 'Old url :' + old_url print 'Real url :' + response.geturl()
After running, you can see the URL pointed to by the real link:
2.info():
This returns a dictionary of objects Object, this dictionary describes the obtained page situation. Usually specific headers sent by the server. Currently an instance of httplib.HTTPMessage.
Classic headers include "Content-length", "Content-type", and other content.
We build a urllib2_test11.py to test the application of info:
from urllib2 import Request, urlopen, URLError, HTTPError old_url = 'http://www.baidu.com' req = Request(old_url) response = urlopen(req) print 'Info():' print response.info()
The running results are as follows, you can see the relevant information of the page :
Let’s talk about two important concepts in urllib2: Openers and Handlers.
1.Openers:
When you get a URL you use an opener (an instance of urllib2.OpenerDirector).
Normally, we use the default opener: through urlopen.
But you can create personalized openers.
2.Handles:
Openers use processor handlers, and all "heavy" work is handled by the handlers.
Each handler knows how to open URLs over a specific protocol, or how to handle various aspects of opening a URL.
For example, HTTP redirection or HTTP cookies.
You will want to create an opener if you want to fetch URLs with a specific handler, for example to get an opener that can handle cookies, or to get an opener that does not redirect.
To create an opener, instantiate an OpenerDirector,
and then call .add_handler(some_handler_instance).
Similarly, you can use build_opener, which is a more convenient function for creating opener objects. It only requires one function call.
build_opener adds several processors by default, but provides a quick way to add or update the default processors.
Other handlers You may want to handle proxies, authentication, and other common but somewhat special cases.
install_opener is used to create a (global) default opener. This means that calling urlopen will use the opener you installed.
The Opener object has an open method.
This method can be used directly to obtain urls like the urlopen function: it is usually not necessary to call install_opener, except for convenience.
After talking about the above two contents, let’s take a look at the basic authentication content. The Opener and Handler mentioned above will be used here.
Basic Authentication Basic Authentication
To demonstrate creating and installing a handler, we will use HTTPBasicAuthHandler.
When basic verification is required, the server sends a header (401 error code) to request verification. This specifies the scheme and a 'realm' and looks like this: Www-authenticate: SCHEME realm="REALM".
For example
Www-authenticate: Basic realm="cPanel Users"
The client must use a new request and include the correct name and password in the request header.
This is "basic authentication". In order to simplify this process, we can create an instance of HTTPBasicAuthHandler and let opener use this handler.
HTTPBasicAuthHandler uses a password management object to handle URLs and realms to map usernames and passwords.
If you know what realm (in the header sent from the server) is, you can use HTTPPasswordMgr.
Usually people don’t care what realm is. In that case, the convenient HTTPPasswordMgrWithDefaultRealm can be used.
This will specify a default username and password for your URL.
This will be provided when you provide an other combination for a specific realm.
We indicate this situation by specifying None for the realm parameter provided to add_password.
The highest-level URL is the first URL that requires authentication. Deeper URLs you pass to .add_password() will be equally suitable.
Having said so much nonsense, let’s use an example to demonstrate what is said above.
We build a urllib2_test12.py to test the info application:
# -*- coding: utf-8 -*- import urllib2 # 创建一个密码管理者 password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm() # 添加用户名和密码 top_level_url = "http://example.com/foo/" # 如果知道 realm, 我们可以使用他代替 ``None``. # password_mgr.add_password(None, top_level_url, username, password) password_mgr.add_password(None, top_level_url,'why', '1223') # 创建了一个新的handler handler = urllib2.HTTPBasicAuthHandler(password_mgr) # 创建 "opener" (OpenerDirector 实例) opener = urllib2.build_opener(handler) a_url = 'http://www.baidu.com/' # 使用 opener 获取一个URL opener.open(a_url) # 安装 opener. # 现在所有调用 urllib2.urlopen 将用我们的 opener. urllib2.install_opener(opener)
Note: In the above example, we only provide our HHTTPasicAuthHandler to build_opener.
The default openers have normal handlers: ProxyHandler, UnknownHandler, HTTPHandler, HTTPDefaultErrorHandler, HTTPRedirectHandler, FTPHandler, FileHandler, HTTPErrorProcessor.
The top_level_url in the code can actually be a complete URL (including "http:", as well as the host name and optional port number).
For example: http://example.com/.
can also be an "authority" (i.e. hostname and optionally port number).
For example: "example.com" or "example.com:8080".
The latter contains the port number.
The above is the content of [Python] Web Crawler (4): Introduction and example applications of Opener and Handler. For more related content, please pay attention to the PHP Chinese website (www.php.cn)!

网络爬虫是一种自动化程序,能够自动访问网站并抓取其中的信息。这种技术在如今的互联网世界中越来越常见,被广泛应用于数据挖掘、搜索引擎、社交媒体分析等领域。如果你想了解如何使用PHP编写简单的网络爬虫,本文将会为你提供基本的指导和建议。首先,需要了解一些基本的概念和技术。爬取目标在编写爬虫之前,需要选择爬取的目标。这可以是一个特定的网站、一个特定的网页、或整个互

使用Vue.js和Perl语言开发高效的网络爬虫和数据抓取工具近年来,随着互联网的迅猛发展和数据的日益重要,网络爬虫和数据抓取工具的需求也越来越大。在这个背景下,结合Vue.js和Perl语言开发高效的网络爬虫和数据抓取工具是一种不错的选择。本文将介绍如何使用Vue.js和Perl语言开发这样一个工具,并附上相应的代码示例。一、Vue.js和Perl语言的介

随着互联网的发展,各种各样的数据变得越来越容易获取。而网络爬虫作为一种获取数据的工具,越来越受到人们的关注和重视。在网络爬虫中,HTTP请求是一个重要的环节,本文将详细介绍PHP网络爬虫中常见的HTTP请求方法。一、HTTP请求方法HTTP请求方法是指客户端向服务器发送请求时,所使用的请求方法。常见的HTTP请求方法有GET、POST、PU

随着互联网的迅速发展,数据已成为了当今信息时代最为重要的资源之一。而网络爬虫作为一种自动化获取和处理网络数据的技术,正越来越受到人们的关注和应用。本文将介绍如何使用PHP开发一个简单的网络爬虫,并实现自动化获取网络数据的功能。一、网络爬虫概述网络爬虫是一种自动化获取和处理网络资源的技术,其主要工作过程是模拟浏览器行为,自动访问指定的URL地址并提取所

如何使用PHP和swoole进行大规模的网络爬虫开发?引言:随着互联网的迅速发展,大数据已经成为当今社会的重要资源之一。为了获取这些宝贵的数据,网络爬虫应运而生。网络爬虫可以自动化地访问互联网上的各种网站,并从中提取所需的信息。在本文中,我们将探讨如何使用PHP和swoole扩展来开发高效的、大规模的网络爬虫。一、了解网络爬虫的基本原理网络爬虫的基本原理很简

随着互联网的迅猛发展,每天都有大量的信息在不同的网站上产生。这些信息包含了各种形式的数据,如文字、图片、视频等。对于那些需要对数据进行全面了解和分析的人来说,手动从互联网上收集数据是不现实的。为了解决这个问题,网络爬虫应运而生。网络爬虫是一种自动化程序,可以从互联网上抓取并提取特定信息。在本文中,我们将介绍如何使用PHP实现网络爬虫。一、网络爬虫的工作原

随着互联网时代的到来,网络数据的爬取与抓取已成为许多人的日常工作。在支持网页开发的程序语言中,PHP以其可扩展性和易上手的特点,成为了网络爬虫和数据抓取的热门选项。本文将从以下几个方面介绍PHP中如何进行网络爬虫和数据抓取。一、HTTP协议和请求实现在进行网络爬虫和数据抓取之前,需要对HTTP协议和请求的实现有一定的了解。HTTP协议是基于请求响应模型的,抓


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

SublimeText3 Linux new version
SublimeText3 Linux latest version

WebStorm Mac version
Useful JavaScript development tools

Dreamweaver CS6
Visual web development tools

SAP NetWeaver Server Adapter for Eclipse
Integrate Eclipse with SAP NetWeaver application server.

SublimeText3 Chinese version
Chinese version, very easy to use
