search
HomeBackend DevelopmentPython Tutorial[Python] Web Crawler (4): Introduction and practical applications of Opener and Handler

Before starting the following content, let’s first explain the two methods in urllib2: info and geturl

The response object response (or HTTPError instance) returned by urlopen has two very useful methods info () and geturl()

1.geturl():

This returns the real URL obtained. This is very useful because urlopen (or used by the opener object) may There are redirects. The obtained URL may be different from the request URL.

Take a hyperlink in Renren as an example,


We build a urllib2_test10.py to compare the original URL and the redirected link:

from urllib2 import Request, urlopen, URLError, HTTPError  
  
  
old_url = 'http://rrurl.cn/b1UZuP'  
req = Request(old_url)  
response = urlopen(req)    
print 'Old url :' + old_url  
print 'Real url :' + response.geturl()

After running, you can see the URL pointed to by the real link:

[Python] Web Crawler (4): Introduction and practical applications of Opener and Handler

2.info():

This returns a dictionary of objects Object, this dictionary describes the obtained page situation. Usually specific headers sent by the server. Currently an instance of httplib.HTTPMessage.

Classic headers include "Content-length", "Content-type", and other content.


We build a urllib2_test11.py to test the application of info:

from urllib2 import Request, urlopen, URLError, HTTPError  
  
old_url = 'http://www.baidu.com'  
req = Request(old_url)  
response = urlopen(req)    
print 'Info():'  
print response.info()

The running results are as follows, you can see the relevant information of the page :

[Python] Web Crawler (4): Introduction and practical applications of Opener and Handler

Let’s talk about two important concepts in urllib2: Openers and Handlers.

1.Openers:

When you get a URL you use an opener (an instance of urllib2.OpenerDirector).

Normally, we use the default opener: through urlopen.

But you can create personalized openers.

2.Handles:

Openers use processor handlers, and all "heavy" work is handled by the handlers.

Each handler knows how to open URLs over a specific protocol, or how to handle various aspects of opening a URL.

For example, HTTP redirection or HTTP cookies.


You will want to create an opener if you want to fetch URLs with a specific handler, for example to get an opener that can handle cookies, or to get an opener that does not redirect.


To create an opener, instantiate an OpenerDirector,

and then call .add_handler(some_handler_instance).

Similarly, you can use build_opener, which is a more convenient function for creating opener objects. It only requires one function call.
build_opener adds several processors by default, but provides a quick way to add or update the default processors.

Other handlers You may want to handle proxies, authentication, and other common but somewhat special cases.


install_opener is used to create a (global) default opener. This means that calling urlopen will use the opener you installed.

The Opener object has an open method.

This method can be used directly to obtain urls like the urlopen function: it is usually not necessary to call install_opener, except for convenience.


After talking about the above two contents, let’s take a look at the basic authentication content. The Opener and Handler mentioned above will be used here.

Basic Authentication Basic Authentication

To demonstrate creating and installing a handler, we will use HTTPBasicAuthHandler.

When basic verification is required, the server sends a header (401 error code) to request verification. This specifies the scheme and a 'realm' and looks like this: Www-authenticate: SCHEME realm="REALM".

For example
Www-authenticate: Basic realm="cPanel Users"

The client must use a new request and include the correct name and password in the request header.

This is "basic authentication". In order to simplify this process, we can create an instance of HTTPBasicAuthHandler and let opener use this handler.


HTTPBasicAuthHandler uses a password management object to handle URLs and realms to map usernames and passwords.

If you know what realm (in the header sent from the server) is, you can use HTTPPasswordMgr.


Usually people don’t care what realm is. In that case, the convenient HTTPPasswordMgrWithDefaultRealm can be used.

This will specify a default username and password for your URL.

This will be provided when you provide an other combination for a specific realm.

We indicate this situation by specifying None for the realm parameter provided to add_password.


The highest-level URL is the first URL that requires authentication. Deeper URLs you pass to .add_password() will be equally suitable.

Having said so much nonsense, let’s use an example to demonstrate what is said above.


We build a urllib2_test12.py to test the info application:

# -*- coding: utf-8 -*-  
import urllib2  
  
# 创建一个密码管理者  
password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()  
  
# 添加用户名和密码  
  
top_level_url = "http://example.com/foo/"  
  
# 如果知道 realm, 我们可以使用他代替 ``None``.  
# password_mgr.add_password(None, top_level_url, username, password)  
password_mgr.add_password(None, top_level_url,'why', '1223')  
  
# 创建了一个新的handler  
handler = urllib2.HTTPBasicAuthHandler(password_mgr)  
  
# 创建 "opener" (OpenerDirector 实例)  
opener = urllib2.build_opener(handler)  
  
a_url = 'http://www.baidu.com/'  
  
# 使用 opener 获取一个URL  
opener.open(a_url)  
  
# 安装 opener.  
# 现在所有调用 urllib2.urlopen 将用我们的 opener.  
urllib2.install_opener(opener)

Note: In the above example, we only provide our HHTTPasicAuthHandler to build_opener.

The default openers have normal handlers: ProxyHandler, UnknownHandler, HTTPHandler, HTTPDefaultErrorHandler, HTTPRedirectHandler, FTPHandler, FileHandler, HTTPErrorProcessor.

The top_level_url in the code can actually be a complete URL (including "http:", as well as the host name and optional port number).


For example: http://example.com/.

can also be an "authority" (i.e. hostname and optionally port number).

For example: "example.com" or "example.com:8080".

The latter contains the port number.

The above is the content of [Python] Web Crawler (4): Introduction and example applications of Opener and Handler. For more related content, please pay attention to the PHP Chinese website (www.php.cn)!


Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
如何使用PHP编写一个简单的网络爬虫如何使用PHP编写一个简单的网络爬虫Jun 14, 2023 am 08:21 AM

网络爬虫是一种自动化程序,能够自动访问网站并抓取其中的信息。这种技术在如今的互联网世界中越来越常见,被广泛应用于数据挖掘、搜索引擎、社交媒体分析等领域。如果你想了解如何使用PHP编写简单的网络爬虫,本文将会为你提供基本的指导和建议。首先,需要了解一些基本的概念和技术。爬取目标在编写爬虫之前,需要选择爬取的目标。这可以是一个特定的网站、一个特定的网页、或整个互

网络爬虫是什么网络爬虫是什么Jun 20, 2023 pm 04:36 PM

网络爬虫(也称为网络蜘蛛)是一种在互联网上搜索和索引内容的机器人。从本质上讲,网络爬虫负责理解网页上的内容,以便在进行查询时检索它。

使用Vue.js和Perl语言开发高效的网络爬虫和数据抓取工具使用Vue.js和Perl语言开发高效的网络爬虫和数据抓取工具Jul 31, 2023 pm 06:43 PM

使用Vue.js和Perl语言开发高效的网络爬虫和数据抓取工具近年来,随着互联网的迅猛发展和数据的日益重要,网络爬虫和数据抓取工具的需求也越来越大。在这个背景下,结合Vue.js和Perl语言开发高效的网络爬虫和数据抓取工具是一种不错的选择。本文将介绍如何使用Vue.js和Perl语言开发这样一个工具,并附上相应的代码示例。一、Vue.js和Perl语言的介

PHP 网络爬虫之 HTTP 请求方法详解PHP 网络爬虫之 HTTP 请求方法详解Jun 17, 2023 am 11:53 AM

随着互联网的发展,各种各样的数据变得越来越容易获取。而网络爬虫作为一种获取数据的工具,越来越受到人们的关注和重视。在网络爬虫中,HTTP请求是一个重要的环节,本文将详细介绍PHP网络爬虫中常见的HTTP请求方法。一、HTTP请求方法HTTP请求方法是指客户端向服务器发送请求时,所使用的请求方法。常见的HTTP请求方法有GET、POST、PU

PHP 简单网络爬虫开发实例PHP 简单网络爬虫开发实例Jun 13, 2023 pm 06:54 PM

随着互联网的迅速发展,数据已成为了当今信息时代最为重要的资源之一。而网络爬虫作为一种自动化获取和处理网络数据的技术,正越来越受到人们的关注和应用。本文将介绍如何使用PHP开发一个简单的网络爬虫,并实现自动化获取网络数据的功能。一、网络爬虫概述网络爬虫是一种自动化获取和处理网络资源的技术,其主要工作过程是模拟浏览器行为,自动访问指定的URL地址并提取所

如何使用PHP和swoole进行大规模的网络爬虫开发?如何使用PHP和swoole进行大规模的网络爬虫开发?Jul 21, 2023 am 09:09 AM

如何使用PHP和swoole进行大规模的网络爬虫开发?引言:随着互联网的迅速发展,大数据已经成为当今社会的重要资源之一。为了获取这些宝贵的数据,网络爬虫应运而生。网络爬虫可以自动化地访问互联网上的各种网站,并从中提取所需的信息。在本文中,我们将探讨如何使用PHP和swoole扩展来开发高效的、大规模的网络爬虫。一、了解网络爬虫的基本原理网络爬虫的基本原理很简

基于 PHP 的网络爬虫实现:从网页中提取关键信息基于 PHP 的网络爬虫实现:从网页中提取关键信息Jun 13, 2023 pm 04:43 PM

随着互联网的迅猛发展,每天都有大量的信息在不同的网站上产生。这些信息包含了各种形式的数据,如文字、图片、视频等。对于那些需要对数据进行全面了解和分析的人来说,手动从互联网上收集数据是不现实的。为了解决这个问题,网络爬虫应运而生。网络爬虫是一种自动化程序,可以从互联网上抓取并提取特定信息。在本文中,我们将介绍如何使用PHP实现网络爬虫。一、网络爬虫的工作原

PHP中如何进行网络爬虫和数据抓取?PHP中如何进行网络爬虫和数据抓取?May 20, 2023 pm 09:51 PM

随着互联网时代的到来,网络数据的爬取与抓取已成为许多人的日常工作。在支持网页开发的程序语言中,PHP以其可扩展性和易上手的特点,成为了网络爬虫和数据抓取的热门选项。本文将从以下几个方面介绍PHP中如何进行网络爬虫和数据抓取。一、HTTP协议和请求实现在进行网络爬虫和数据抓取之前,需要对HTTP协议和请求的实现有一定的了解。HTTP协议是基于请求响应模型的,抓

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

Hot Tools

SublimeText3 Linux new version

SublimeText3 Linux new version

SublimeText3 Linux latest version

WebStorm Mac version

WebStorm Mac version

Useful JavaScript development tools

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SAP NetWeaver Server Adapter for Eclipse

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use