Home > Article > Backend Development > How to obtain network data using Python web crawler
Using the Python language to obtain data from the Internet is a very common task. Python has a library called requests, which is an HTTP client library for Python that is used to make HTTP requests to web servers.
We can use the requests library to initiate an HTTP request to the specified URL through the following code:
import requests response = requests.get('<http://www.example.com>')
Among them, the response
object will contain the response returned by the server. Use response.text
to get the text content of the response.
In addition, we can also use the following code to obtain binary resources:
import requests response = requests.get('<http://www.example.com/image.png>') with open('image.png', 'wb') as f: f.write(response.content)
Use response.content
to obtain the binary data returned by the server.
A crawler is an automated program that can crawl web page data through the network and store it in a database or file. Crawlers are widely used in data collection, information monitoring, content analysis and other fields. The Python language is a commonly used language for crawler writing because it has the advantages of being easy to learn, having a small amount of code, and rich libraries.
We take "Douban Movie" as an example to introduce how to use Python to write crawler code. First, we use the requests library to get the HTML code of the web page, then treat the entire code as a long string, and use the capture group of the regular expression to extract the required content from the string.
The address of Douban Movie Top250 page is https://movie.douban.com/top250?start=0
, where the start
parameter indicates which movie to start from Start getting. A total of 25 movies are displayed on each page. If we want to obtain the Top250 data, we need to visit a total of 10 pages. The corresponding address is https://movie.douban.com/top250?start=xxx
, here If the xxx
is 0
, it is the first page. If the value of xxx
is 100
, then we can access the fifth page.
We take getting the title and rating of a movie as an example. The code is as follows:
import re import requests import time import random for page in range(1, 11): resp = requests.get( url=f'<https://movie.douban.com/top250?start=>{(page - 1) * 25}', headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36'} ) # 通过正则表达式获取class属性为title且标签体不以&开头的span标签并用捕获组提取标签内容 pattern1 = re.compile(r'<span class="title">([^&]*?)</span>') titles = pattern1.findall(resp.text) # 通过正则表达式获取class属性为rating_num的span标签并用捕获组提取标签内容 pattern2 = re.compile(r'<span class="rating_num".*?>(.*?)</span>') ranks = pattern2.findall(resp.text) # 使用zip压缩两个列表,循环遍历所有的电影标题和评分 for title, rank in zip(titles, ranks): print(title, rank) # 随机休眠1-5秒,避免爬取页面过于频繁 time.sleep(random.random() * 4 + 1)
In the above code, we use regular expressions to get the span tag whose tag body is the title and rating. And use capturing groups to extract tag content. Use zip
to compress both lists, looping through all movie titles and ratings.
Many websites are disgusted with crawlers, because crawlers consume a lot of their network bandwidth and create a lot of invalid traffic. In order to hide your identity, you usually need to use an IP proxy to access the website. Commercial IP proxies (such as Mushroom Proxy, Sesame Proxy, Fast Proxy, etc.) are a good choice. Using commercial IP proxies can prevent the crawled website from obtaining the real IP address of the source of the crawler program, making it impossible to simply use the IP address. The crawler program is blocked.
Taking Mushroom Agent as an example, we can register an account on the website and then purchase the corresponding package to obtain a commercial IP agent. Mushroom proxy provides two ways to access the proxy, namely API private proxy and HTTP tunnel proxy. The former obtains the proxy server address by requesting the API interface of Mushroom proxy, and the latter directly uses the unified proxy server IP and port.
The code for using IP proxy is as follows:
import requests proxies = { 'http': '<http://username:password@ip>:port', 'https': '<https://username:password@ip>:port' } response = requests.get('<http://www.example.com>', proxies=proxies)
Among them, username
and password
are the username and password of the mushroom proxy account respectively,ip
and port
are the IP address and port number of the proxy server respectively. Note that different proxy providers may have different access methods and need to be modified accordingly according to the actual situation.
The above is the detailed content of How to obtain network data using Python web crawler. For more information, please follow other related articles on the PHP Chinese website!