Home  >  Article  >  Backend Development  >  [Python] Web crawler (2): Use urllib2 to crawl web content through the specified URL

[Python] Web crawler (2): Use urllib2 to crawl web content through the specified URL

黄舟
黄舟Original
2017-01-21 13:47:431581browse

Version number: Python2.7.5, Python3 has undergone major changes, please look for another tutorial.

The so-called web page crawling is to read the network resources specified in the URL address from the network stream and save them locally.
Similar to using a program to simulate the function of the IE browser, the URL is sent to the server as the content of the HTTP request, and then the response resources of the server are read.


In Python, we use the urllib2 component to crawl web pages.
urllib2 is a component of Python that obtains URLs (Uniform Resource Locators).

It provides a very simple interface in the form of the urlopen function.

The simplest urllib2 application code only requires four lines.

Let’s create a new file urllib2_test01.py to feel the role of urllib2:

import urllib2  
response = urllib2.urlopen('http://www.baidu.com/')  
html = response.read()  
print html

Press F5 to see the results of the operation:

[Python] Web crawler (2): Use urllib2 to crawl web content through the specified URL


We can open the Baidu homepage, right-click, and select View Source Code (either Firefox or Google Chrome), and you will find that the content is exactly the same.

In other words, the above four lines of code print out all the codes received by the browser when we visit Baidu.

This is the simplest example of urllib2.


In addition to "http:", the URL can also be replaced by "ftp:", "file:", etc.

HTTP is based on the request and response mechanism:

The client makes a request and the server provides a response.


#urllib2 uses a Request object to map the HTTP request you make.

In its simplest form of use, you will create a Request object with the address you want to request.

By calling urlopen and passing in the Request object, a related request response object will be returned.

This response object is like a file object, so you can call .read() in Response.

Let’s create a new file urllib2_test02.py to get a feel for it:

import urllib2    
req = urllib2.Request('http://www.baidu.com')    
response = urllib2.urlopen(req)    
the_page = response.read()    
print the_page

You can see that the output content is the same as test01.

urllib2 uses the same interface to handle all URL headers. For example you can create an ftp request like below.

req = urllib2.Request('ftp://example.com/')

Allows you to do two additional things when making HTTP requests.


1. Send data form data

I believe this content will be familiar to anyone who has worked on the Web side.

Sometimes you want to send Some data to a URL (usually the URL is hooked into a CGI [Common Gateway Interface] script, or other WEB application).

In HTTP, this is often sent using the well-known POST request.

This is usually done by your browser when you submit an HTML form.

Not all POSTs come from forms, you can use POST to submit arbitrary data to your own program.

In general HTML forms, data needs to be encoded into a standard form. Then pass it to the Request object as the data parameter.

Encoding work uses urllib functions instead of urllib2.

Let’s create a new file urllib2_test03.py to get a feel for it:

import urllib    
import urllib2    
  
url = 'http://www.someserver.com/register.cgi'    
    
values = {'name' : 'WHY',    
          'location' : 'SDU',    
          'language' : 'Python' }    
  
data = urllib.urlencode(values) # 编码工作  
req = urllib2.Request(url, data)  # 发送请求同时传data表单  
response = urllib2.urlopen(req)  #接受反馈的信息  
the_page = response.read()  #读取反馈的内容


If the data parameter is not sent, urllib2 uses the GET request method.

The difference between GET and POST requests is that POST requests usually have "side effects",

They will change the state of the system in some way (such as submitting piles of garbage to your door).

Data can also be transmitted by encoding in the URL itself in the Get request.

import urllib2    
import urllib  
  
data = {}  
  
data['name'] = 'WHY'    
data['location'] = 'SDU'    
data['language'] = 'Python'  
  
url_values = urllib.urlencode(data)    
print url_values  
  
name=Somebody+Here&language=Python&location=Northampton    
url = 'http://www.example.com/example.cgi'    
full_url = url + '?' + url_values  
  
data = urllib2.open(full_url)

In this way, the Get transmission of Data data is realized.



2. Set Headers to http request

Some sites don’t like to be visited by programs (non-human visits) , or send different versions of content to different browsers.

The default urllib2 identifies itself as "Python-urllib/x.y" (x and y are the Python major and minor version numbers, such as Python-urllib/2.7),

This identity may Make the site confusing, or simply not work.

The browser confirms its identity through the User-Agent header. When you create a request object, you can give it a dictionary containing header data.

The following example sends the same content as above, but simulates itself as Internet Explorer.

(Thank you for your reminder. This demo is no longer available, but the principle is still the same).

import urllib    
import urllib2    
  
url = 'http://www.someserver.com/cgi-bin/register.cgi'  
  
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'    
values = {'name' : 'WHY',    
          'location' : 'SDU',    
          'language' : 'Python' }    
  
headers = { 'User-Agent' : user_agent }    
data = urllib.urlencode(values)    
req = urllib2.Request(url, data, headers)    
response = urllib2.urlopen(req)    
the_page = response.read()

The above is [Python] web crawler (2): using urllib2 to crawl web content through the specified URL. For more related content, please pay attention to the PHP Chinese website (www.php.cn)!


Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn