Home >Backend Development >Python Tutorial >How does a python crawler crawl the page data requested by get? (with code)

How does a python crawler crawl the page data requested by get? (with code)

不言
不言Original
2018-09-15 14:40:246565browse

The content of this article is about how the python crawler crawls the page data requested by get (with code). It has certain reference value. Friends in need can refer to it. I hope it will be helpful to you.

1.urllib library

urllib is a library for crawlers that comes with Python. Its main function is to simulate the browser to send requests through code. Its commonly used submodules are urllib.request and urllib.parse in Python3, and urllib and urllib2 in Python2.

2. Crawler program from easy to difficult:

1. Crawl all data values ​​​​on Baidu home page

#!/usr/bin/env python 
# -*- coding:utf-8 -*-
#导包
import urllib.request
import urllib.parse
if __name__ == "__main__":
    #指定爬取的网页url
    url = 'http://www.baidu.com/'
    #通过urlopen函数向指定的url发起请求,返回响应对象
    reponse = urllib.request.urlopen(url=url)
    #通过调用响应对象中的read函数,返回响应回客户端的数据值(爬取到的数据)
    data = reponse.read()#返回的数据为byte类型,并非字符串
    print(data)#打印显示爬取到的数据值。

#Supplementary instructions
urlopen function prototype:

urllib.request.urlopen(url, data=None, timeout=<object object at 0x10af327d0>, *, cafile=None, capath=None, cadefault=False, context=None)

In the above case we only used the first parameter url in the function. In daily development, the only two parameters we can use are url and data.

url parameter: Specify which url to initiate a request to

data parameter: The parameters carried in the post request can be encapsulated into a dictionary and passed to this parameter (no need to understand it for now, we will talk about it later) )

The response object returned by the urlopen function, related function call introduction:

response.headers(): Get the response header information
response.getcode(): Get the response status code
response.geturl(): Get the requested url
response.read(): Get the data value in the response (byte type)

2. Write the data value crawled to the Baidu News homepage Enter the file for storage

#!/usr/bin/env python 
# -*- coding:utf-8 -*-
import urllib.request
import urllib.parse
if __name__ == "__main__":
    url = &#39;http://news.baidu.com/&#39;
    reponse = urllib.request.urlopen(url=url)
    #decode()作用是将响应中字节(byte)类型的数据值转成字符串类型
    data = reponse.read().decode()
    #使用IO操作将data表示的数据值以&#39;w&#39;权限的方式写入到news.html文件中
    with open(&#39;./news.html&#39;,&#39;w&#39;) as fp:
        fp.write(data)
    print(&#39;写入文件完毕&#39;)

3. Crawl a certain image data on the network and store it locally

#!/usr/bin/env python 
# -*- coding:utf-8 -*-
import urllib.request
import urllib.parse
#如下两行代码表示忽略https证书,因为下面请求的url为https协议的请求,如果请求不是https则该两行代码可不用。
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
if __name__ == "__main__":
    #url是https协议的
    url = &#39;https://timgsa.baidu.com/timg?image&quality=80&size=b9999_10000&sec=1536918978042&di=172c5a4583ca1d17a1a49dba2914cfb9&imgtype=0&src=http%3A%2F%2Fimgsrc.baidu.com%2Fimgad%2Fpic%2Fitem%2F0dd7912397dda144f04b5d9cb9b7d0a20cf48659.jpg&#39;
    reponse = urllib.request.urlopen(url=url)
    data = reponse.read()#因为爬取的是图片数据值(二进制数据),则无需使用decode进行类型转换。
    with open(&#39;./money.jpg&#39;,&#39;wb&#39;) as fp:
        fp.write(data)
    print(&#39;写入文件完毕&#39;)

4. Characteristics of url: The url must be an ASCII-encoded data value. Therefore, when we write the URL in the crawler code, if there is a non-ASCII encoded data value in the URL, it must be ASCII encoded before the URL can be used.

Case: Crawl the page data searched by Baidu based on the specified term (for example, crawl the page data with the term 'Jay Chou')

#!/usr/bin/env python 
# -*- coding:utf-8 -*-
import urllib.request
import urllib.parse
if __name__ == "__main__":
    #原始url中存在非ASCII编码的值,则该url无法被使用。
    #url = &#39;http://www.baidu.com/s?ie=utf-8&kw=周杰伦&#39;
    #处理url中存在的非ASCII数据值
    url = &#39;http://www.baidu.com/s?&#39;
    #将带有非ASCII的数据封装到字典中,url中非ASCII的数据往往都是&#39;?&#39;后面键值形式的请求参数
    param = {
        &#39;ie&#39;:&#39;utf-8&#39;,
        &#39;wd&#39;:&#39;周杰伦&#39;
    }
    #使用parse子模块中的urlencode函数将封装好的字典中存在的非ASCII的数值进行ASCII编码
    param = urllib.parse.urlencode(param)
    #将编码后的数据和url进行整合拼接成一个完整可用的url
    url = url + param
    print(url)
    response = urllib.request.urlopen(url=url)
    data = response.read()
    with open(&#39;./周杰伦.html&#39;,&#39;wb&#39;) as fp:
        fp.write(data)
    print(&#39;写入文件完毕&#39;)

5. By customizing the request object, Used to disguise the identity requested by the crawler.

When we explained the common HTTP request header information before, we explained the User-Agent parameter, referred to as UA. The function of this parameter is to indicate the identity of the request carrier. If we initiate a request through a browser, the carrier of the request is the current browser, and the value of the UA parameter indicates a string of data represented by the identity of the current browser. If we use a request initiated by a crawler program, the carrier of the request is the crawler program, and the UA of the request is a string of data represented by the identity of the crawler program. Some websites will determine whether the carrier of the request is a crawler program by identifying the UA of the request. If it is a crawler program, no response will be returned to the request, and our crawler program will not be able to crawl the website through the request. Data value, this is also a primary technical means of anti-crawling. In order to prevent this problem from occurring, we can disguise the UA of the crawler program as the identity of a certain browser.

In the above case, we initiated the request through urlopen in the request module. The request object is the default request object built in urllib, and we cannot change it through UA. urllib also provides us with a way to customize the request object. We can disguise (change) the UA in the request object by customizing the request object.

#!/usr/bin/env python 
# -*- coding:utf-8 -*-
import urllib.request
import urllib.parse

import ssl
ssl._create_default_https_context = ssl._create_unverified_context

if __name__ == "__main__":
    #原始url中存在非ASCII编码的值,则该url无法被使用。
    #url = &#39;http://www.baidu.com/s?ie=utf-8&kw=周杰伦&#39;
    #处理url中存在的非ASCII数据值
    url = &#39;http://www.baidu.com/s?&#39;
    #将带有非ASCII的数据封装到字典中,url中非ASCII的数据往往都是&#39;?&#39;后面键值形式的请求参数
    param = {
        &#39;ie&#39;:&#39;utf-8&#39;,
        &#39;wd&#39;:&#39;周杰伦&#39;
    }
    #使用parse子模块中的urlencode函数将封装好的字典中存在的非ASCII的数值进行ASCII编码
    param = urllib.parse.urlencode(param)
    #将编码后的数据和url进行整合拼接成一个完整可用的url
    url = url + param
    #将浏览器的UA数据获取,封装到一个字典中。该UA值可以通过抓包工具或者浏览器自带的开发者工具中获取某请求,从中获取UA的值
    headers={
        &#39;User-Agent&#39; : &#39;Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36&#39;
    }
    #自定义一个请求对象
    #参数:url为请求的url。headers为UA的值。data为post请求的请求参数(后面讲)
    request = urllib.request.Request(url=url,headers=headers)

    #发送我们自定义的请求(该请求的UA已经进行了伪装)
    response = urllib.request.urlopen(request)

    data=response.read()

    with open(&#39;./周杰伦.html&#39;,&#39;wb&#39;) as fp:
        fp.write(data)
    print(&#39;写入数据完毕&#39;)

Related recommendations:

python crawler beta version to crawl Zhihu single page

##Python crawler tool list Encyclopedia

The above is the detailed content of How does a python crawler crawl the page data requested by get? (with code). For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn