Using Python to obtain network data
Using the Python language to obtain data from the Internet is a very common task. Python has a library called requests, which is an HTTP client library for Python that is used to make HTTP requests to web servers.
We can use the requests library to initiate an HTTP request to the specified URL through the following code:
import requests response = requests.get('<http://www.example.com>')
Among them, the response
object will contain the response returned by the server. Use response.text
to get the text content of the response.
In addition, we can also use the following code to obtain binary resources:
import requests response = requests.get('<http://www.example.com/image.png>') with open('image.png', 'wb') as f: f.write(response.content)
Use response.content
to obtain the binary data returned by the server.
Writing crawler code
A crawler is an automated program that can crawl web page data through the network and store it in a database or file. Crawlers are widely used in data collection, information monitoring, content analysis and other fields. The Python language is a commonly used language for crawler writing because it has the advantages of being easy to learn, having a small amount of code, and rich libraries.
We take "Douban Movie" as an example to introduce how to use Python to write crawler code. First, we use the requests library to get the HTML code of the web page, then treat the entire code as a long string, and use the capture group of the regular expression to extract the required content from the string.
The address of Douban Movie Top250 page is https://movie.douban.com/top250?start=0
, where the start
parameter indicates which movie to start from Start getting. A total of 25 movies are displayed on each page. If we want to obtain the Top250 data, we need to visit a total of 10 pages. The corresponding address is https://movie.douban.com/top250?start=xxx
, here If the xxx
is 0
, it is the first page. If the value of xxx
is 100
, then we can access the fifth page.
We take getting the title and rating of a movie as an example. The code is as follows:
import re import requests import time import random for page in range(1, 11): resp = requests.get( url=f'<https://movie.douban.com/top250?start=>{(page - 1) * 25}', headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36'} ) # 通过正则表达式获取class属性为title且标签体不以&开头的span标签并用捕获组提取标签内容 pattern1 = re.compile(r'<span class="title">([^&]*?)</span>') titles = pattern1.findall(resp.text) # 通过正则表达式获取class属性为rating_num的span标签并用捕获组提取标签内容 pattern2 = re.compile(r'<span class="rating_num".*?>(.*?)</span>') ranks = pattern2.findall(resp.text) # 使用zip压缩两个列表,循环遍历所有的电影标题和评分 for title, rank in zip(titles, ranks): print(title, rank) # 随机休眠1-5秒,避免爬取页面过于频繁 time.sleep(random.random() * 4 + 1)
In the above code, we use regular expressions to get the span tag whose tag body is the title and rating. And use capturing groups to extract tag content. Use zip
to compress both lists, looping through all movie titles and ratings.
Use IP proxy
Many websites are disgusted with crawlers, because crawlers consume a lot of their network bandwidth and create a lot of invalid traffic. In order to hide your identity, you usually need to use an IP proxy to access the website. Commercial IP proxies (such as Mushroom Proxy, Sesame Proxy, Fast Proxy, etc.) are a good choice. Using commercial IP proxies can prevent the crawled website from obtaining the real IP address of the source of the crawler program, making it impossible to simply use the IP address. The crawler program is blocked.
Taking Mushroom Agent as an example, we can register an account on the website and then purchase the corresponding package to obtain a commercial IP agent. Mushroom proxy provides two ways to access the proxy, namely API private proxy and HTTP tunnel proxy. The former obtains the proxy server address by requesting the API interface of Mushroom proxy, and the latter directly uses the unified proxy server IP and port.
The code for using IP proxy is as follows:
import requests proxies = { 'http': '<http://username:password@ip>:port', 'https': '<https://username:password@ip>:port' } response = requests.get('<http://www.example.com>', proxies=proxies)
Among them, username
and password
are the username and password of the mushroom proxy account respectively,ip
and port
are the IP address and port number of the proxy server respectively. Note that different proxy providers may have different access methods and need to be modified accordingly according to the actual situation.
The above is the detailed content of How to obtain network data using Python web crawler. For more information, please follow other related articles on the PHP Chinese website!

本篇文章给大家带来了关于Python的相关知识,其中主要介绍了关于Seaborn的相关问题,包括了数据可视化处理的散点图、折线图、条形图等等内容,下面一起来看一下,希望对大家有帮助。

本篇文章给大家带来了关于Python的相关知识,其中主要介绍了关于进程池与进程锁的相关问题,包括进程池的创建模块,进程池函数等等内容,下面一起来看一下,希望对大家有帮助。

本篇文章给大家带来了关于Python的相关知识,其中主要介绍了关于简历筛选的相关问题,包括了定义 ReadDoc 类用以读取 word 文件以及定义 search_word 函数用以筛选的相关内容,下面一起来看一下,希望对大家有帮助。

本篇文章给大家带来了关于Python的相关知识,其中主要介绍了关于数据类型之字符串、数字的相关问题,下面一起来看一下,希望对大家有帮助。

VS Code的确是一款非常热门、有强大用户基础的一款开发工具。本文给大家介绍一下10款高效、好用的插件,能够让原本单薄的VS Code如虎添翼,开发效率顿时提升到一个新的阶段。

pythn的中文意思是巨蟒、蟒蛇。1989年圣诞节期间,Guido van Rossum在家闲的没事干,为了跟朋友庆祝圣诞节,决定发明一种全新的脚本语言。他很喜欢一个肥皂剧叫Monty Python,所以便把这门语言叫做python。

本篇文章给大家带来了关于Python的相关知识,其中主要介绍了关于numpy模块的相关问题,Numpy是Numerical Python extensions的缩写,字面意思是Python数值计算扩展,下面一起来看一下,希望对大家有帮助。


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

Atom editor mac version download
The most popular open source editor

Dreamweaver Mac version
Visual web development tools

Dreamweaver CS6
Visual web development tools

DVWA
Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software
