Home > Article > Backend Development > Basics of using the urllib.request library
The so-called web page crawling is to read the network resources specified in the URL address from the network stream and save them locally. There are many libraries in Python that can be used to crawl web pages. Let’s learn urllib.request first. (urllib2 in python2.x)
We first read the following code:
#!/usr/bin/python3 # -*- conding:utf-8 -*- __author__ = 'mayi' # 导入urllib.request库 import urllib.request # 向指定的url发送请求,并返回服务器响应的类文件对象 response = urllib.request.urlopen("http://www.baidu.com/") # 类文件对象支持 文件对象的操作方法,如read()方法读取文件全部内容,返回字符串 html = response.read() # 打印字符串 print(html)
Actually, if we are on the browser Open the Baidu homepage, right-click and select "View Source Code". You will find that the output result is exactly the same as when we execute the above program. In other words, the above few lines of code have helped us crawl down all the code of Baidu's homepage.
In the above example, the parameter of urlopen() is a url address.
But if you need to perform more complex operations, such as adding HTTP headers, you must create a Request instance as a parameter of urlopen(), and the url address that needs to be accessed as a parameter of the Request instance.
#!/usr/bin/python3 # -*- conding:utf-8 -*- __author__ = 'mayi' # 导入urllib.request库 import urllib.request # url 作为Request()方法的参数,构造并返回一个Request对象 request = urllib.request.Request("http://www.baidu.com/") # 向服务器发送这个请求 response = urllib.request.urlopen(request) html = response.read() print(html)
The running results are exactly the same:
In addition to the url parameter, the newly created Request instance can also set two other parameters:
1.data (empty by default): It is the data submitted with the URL (such as post data), and the HTTP request will be changed from "GET" to "POST".
2.headers (empty by default): is a dictionary containing key-value pairs of HTTP headers that need to be sent.
If we want our crawler program to be more like a real user, then our first step is to pretend to be a recognized browser. Different browsers will have different User-Agent headers when sending requests. The default User-Agent header of urllib.request is: Python-urllib/x.y (x and y are the Python major and minor version numbers, such as Python-urllib/3.5)
#!/usr/bin/python3 # -*- conding:utf-8 -*- __author__ = 'mayi' # 导入urllib.request库 import urllib.request # chrome 的 User-Agent,包含在 header里 header = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36'} # url 连同 headers,一起构造Request请求,这个请求将附带 chrome 浏览器的User-Agent request = urllib.request.Request("http://www.baidu.com/", headers = header) # 向服务器发送这个请求 response = urllib.request.urlopen(request) html = response.read() print(html)
Add a specific Header to the HTTP Request to construct a complete HTTP request message.
可以通过调用Request.add_header() 添加/修改一个特定的header 也可以通过调用Request.get_header()来查看已有的header。
Add a specific header
#!/usr/bin/python3 # -*- conding:utf-8 -*- __author__ = 'mayi' # 导入urllib.request库 import urllib.request # chrome 的 User-Agent,包含在 header里 header = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36'} # url 连同 headers,一起构造Request请求,这个请求将附带 chrome 浏览器的User-Agent request = urllib.request.Request("http://www.baidu.com/", headers = header) # 也可以通过调用Request.add_header() 添加/修改一个特定的header request.add_header("Connection", "keep-alive") # 向服务器发送这个请求 response = urllib.request.urlopen(request) html = response.read() print(html)
The above is the detailed content of Basics of using the urllib.request library. For more information, please follow other related articles on the PHP Chinese website!