Home >Backend Development >Python Tutorial >Detailed example of python3 using the requests module to crawl page content

Detailed example of python3 using the requests module to crawl page content

巴扎黑
巴扎黑Original
2017-09-26 10:39:123385browse

This article mainly introduces the actual practice of using python3 to crawl page content using the requests module. It has certain reference value. If you are interested, you can learn more

1. Install pip

My personal desktop system uses linuxmint. The system does not have pip installed by default. Considering that pip will be used to install the requests module later, I will install pip as the first step here.


$ sudo apt install python-pip

The installation is successful, check the PIP version:


$ pip -V

2. Install the requests module

Here I installed it through pip:


$ pip install requests

Run import requests, if no error is prompted, then It means the installation has been successful!

Verify whether the installation is successful

3. Install beautifulsoup4

Beautiful Soup is a tool that can be downloaded from HTML or XML Python library for extracting data from files. It enables customary document navigation, ways to find and modify documents through your favorite converter. Beautiful Soup will save you hours or even days of work.


$ sudo apt-get install python3-bs4

Note: I am using the python3 installation method here. If you are using python2, you can use the following command to install it.


$ sudo pip install beautifulsoup4

4.A brief analysis of the requests module

1) Send a request

First of all, of course, you must import Requests Module:


>>> import requests

Then, get the target crawled web page. Here I take the following as an example:


>>> r = requests.get('http://www.jb51.net/article/124421.htm')

Here returns a response object named r. We can get all the information we want from this object. The get here is the response method of http, so you can also replace it with put, delete, post, and head by analogy.

2) Pass URL parameters

Sometimes we want to pass some kind of data for the query string of the URL. If you build the URL by hand, the data is placed in the URL as key/value pairs, followed by a question mark. For example, cnblogs.com/get?key=val. Requests allow you to use the params keyword argument to provide these parameters as a dictionary of strings.

For example, when we google search for the keyword "python crawler", parameters such as newwindow (new window opens), q and oq (search keywords) can be manually formed into the URL, then you can use the following code :


>>> payload = {'newwindow': '1', 'q': 'python爬虫', 'oq': 'python爬虫'}

>>> r = requests.get("https://www.google.com/search", params=payload)

3) Response content

Get the page response content through r.text or r.content.


>>> import requests

>>> r = requests.get('https://github.com/timeline.json')

>>> r.text

Requests automatically decode content from the server. Most unicode character sets can be decoded seamlessly. Here is a little addition about the difference between r.text and r.content. To put it simply:

resp.text returns Unicode data;

resp.content returns data of bytes type. It is binary data;

So if you want to get text, you can pass r.text. If you want to get pictures or files, you can pass r.content.

4) Get the web page encoding


>>> r = requests.get('http://www.cnblogs.com/')

>>> r.encoding

'utf-8'

5) Get the response status code

We can detect the response status code:


>>> r = requests.get('http://www.cnblogs.com/')

>>> r.status_code

200

5. Case Demonstration

The company has just introduced an OA system recently. Here I take its official documentation page as an example, and Only capture useful information such as article titles and content on the page.

Demo environment

Operating system: linuxmint

Python version: python 3.5.2

Using modules: requests, beautifulsoup4

Code As follows:


#!/usr/bin/env python
# -*- coding: utf-8 -*-
_author_ = 'GavinHsueh'

import requests
import bs4

#要抓取的目标页码地址
url = 'http://www.ranzhi.org/book/ranzhi/about-ranzhi-4.html'

#抓取页码内容,返回响应对象
response = requests.get(url)

#查看响应状态码
status_code = response.status_code

#使用BeautifulSoup解析代码,并锁定页码指定标签内容
content = bs4.BeautifulSoup(response.content.decode("utf-8"), "lxml")
element = content.find_all(id='book')

print(status_code)
print(element)

The program runs and returns the crawling result:

The crawl is successful

About the problem of garbled crawling results

In fact, at first I directly used the python2 that comes with the system by default, but I struggled for a long time with the problem of garbled encoding of the content returned by crawling, google Various solutions have failed. After being "made crazy" by python2, I had no choice but to use python3 honestly. Regarding the problem of garbled content in python2's crawled pages, seniors are welcome to share their experiences to help future generations like me avoid detours.

Postscript

Python has many crawler-related modules, in addition to the requests module, there are also urllib, pycurl, and tornado, etc. In comparison, I personally feel that the requests module is relatively simple and easy to use. Through text, you can quickly learn to use python's requests module to crawl page content. My ability is limited. If there are any mistakes in the article, please feel free to let me know. Secondly, if you have any questions about the content of the page crawled by python, you are also welcome to discuss with everyone.

The above is the detailed content of Detailed example of python3 using the requests module to crawl page content. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn