How to obtain network data using Python web crawler-Python Tutorial-php.cn

Home

Backend Development

Python Tutorial

How to obtain network data using Python web crawler

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

May 12, 2023 am 11:04 AM

python

Using Python to obtain network data

Using the Python language to obtain data from the Internet is a very common task. Python has a library called requests, which is an HTTP client library for Python that is used to make HTTP requests to web servers.

We can use the requests library to initiate an HTTP request to the specified URL through the following code:

import requests
response = requests.get(&#39;<http://www.example.com>&#39;)

Among them, the response object will contain the response returned by the server. Use response.text to get the text content of the response.

In addition, we can also use the following code to obtain binary resources:

import requests
response = requests.get(&#39;<http://www.example.com/image.png>&#39;)
with open(&#39;image.png&#39;, &#39;wb&#39;) as f:
    f.write(response.content)

Use response.content to obtain the binary data returned by the server.

Writing crawler code

A crawler is an automated program that can crawl web page data through the network and store it in a database or file. Crawlers are widely used in data collection, information monitoring, content analysis and other fields. The Python language is a commonly used language for crawler writing because it has the advantages of being easy to learn, having a small amount of code, and rich libraries.

We take "Douban Movie" as an example to introduce how to use Python to write crawler code. First, we use the requests library to get the HTML code of the web page, then treat the entire code as a long string, and use the capture group of the regular expression to extract the required content from the string.

The address of Douban Movie Top250 page is https://movie.douban.com/top250?start=0, where the start parameter indicates which movie to start from Start getting. A total of 25 movies are displayed on each page. If we want to obtain the Top250 data, we need to visit a total of 10 pages. The corresponding address is https://movie.douban.com/top250?start=xxx, here If the xxx is 0, it is the first page. If the value of xxx is 100, then we can access the fifth page.

We take getting the title and rating of a movie as an example. The code is as follows:

import re
import requests
import time
import random
for page in range(1, 11):
    resp = requests.get(
        url=f&#39;<https://movie.douban.com/top250?start=>{(page - 1) * 25}&#39;,
        headers={&#39;User-Agent&#39;: &#39;Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36&#39;}
    )
    # 通过正则表达式获取class属性为title且标签体不以&开头的span标签并用捕获组提取标签内容
    pattern1 = re.compile(r&#39;<span class="title">([^&]*?)</span>&#39;)
    titles = pattern1.findall(resp.text)
    # 通过正则表达式获取class属性为rating_num的span标签并用捕获组提取标签内容
    pattern2 = re.compile(r&#39;<span class="rating_num".*?>(.*?)</span>&#39;)
    ranks = pattern2.findall(resp.text)
    # 使用zip压缩两个列表，循环遍历所有的电影标题和评分
    for title, rank in zip(titles, ranks):
        print(title, rank)
    # 随机休眠1-5秒，避免爬取页面过于频繁
    time.sleep(random.random() * 4 + 1)

In the above code, we use regular expressions to get the span tag whose tag body is the title and rating. And use capturing groups to extract tag content. Use zip to compress both lists, looping through all movie titles and ratings.

Use IP proxy

Many websites are disgusted with crawlers, because crawlers consume a lot of their network bandwidth and create a lot of invalid traffic. In order to hide your identity, you usually need to use an IP proxy to access the website. Commercial IP proxies (such as Mushroom Proxy, Sesame Proxy, Fast Proxy, etc.) are a good choice. Using commercial IP proxies can prevent the crawled website from obtaining the real IP address of the source of the crawler program, making it impossible to simply use the IP address. The crawler program is blocked.

Taking Mushroom Agent as an example, we can register an account on the website and then purchase the corresponding package to obtain a commercial IP agent. Mushroom proxy provides two ways to access the proxy, namely API private proxy and HTTP tunnel proxy. The former obtains the proxy server address by requesting the API interface of Mushroom proxy, and the latter directly uses the unified proxy server IP and port.

The code for using IP proxy is as follows:

import requests
proxies = {
    &#39;http&#39;: &#39;<http://username:password@ip>:port&#39;,
    &#39;https&#39;: &#39;<https://username:password@ip>:port&#39;
}
response = requests.get(&#39;<http://www.example.com>&#39;, proxies=proxies)

Among them, username and password are the username and password of the mushroom proxy account respectively,ip and port are the IP address and port number of the proxy server respectively. Note that different proxy providers may have different access methods and need to be modified accordingly according to the actual situation.

The above is the detailed content of How to obtain network data using Python web crawler. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:亿速云. If there is any infringement, please contact admin@php.cn delete

详细讲解Python之Seaborn（数据可视化）Apr 21, 2022 pm 06:08 PM

本篇文章给大家带来了关于Python的相关知识，其中主要介绍了关于Seaborn的相关问题，包括了数据可视化处理的散点图、折线图、条形图等等内容，下面一起来看一下，希望对大家有帮助。

详细了解Python进程池与进程锁May 10, 2022 pm 06:11 PM

本篇文章给大家带来了关于Python的相关知识，其中主要介绍了关于进程池与进程锁的相关问题，包括进程池的创建模块，进程池函数等等内容，下面一起来看一下，希望对大家有帮助。

Python自动化实践之筛选简历Jun 07, 2022 pm 06:59 PM

本篇文章给大家带来了关于Python的相关知识，其中主要介绍了关于简历筛选的相关问题，包括了定义 ReadDoc 类用以读取 word 文件以及定义 search_word 函数用以筛选的相关内容，下面一起来看一下，希望对大家有帮助。

归纳总结Python标准库May 03, 2022 am 09:00 AM

本篇文章给大家带来了关于Python的相关知识，其中主要介绍了关于标准库总结的相关问题，下面一起来看一下，希望对大家有帮助。

Python数据类型详解之字符串、数字Apr 27, 2022 pm 07:27 PM

本篇文章给大家带来了关于Python的相关知识，其中主要介绍了关于数据类型之字符串、数字的相关问题，下面一起来看一下，希望对大家有帮助。

分享10款高效的VSCode插件，总有一款能够惊艳到你！！Mar 09, 2021 am 10:15 AM

VS Code的确是一款非常热门、有强大用户基础的一款开发工具。本文给大家介绍一下10款高效、好用的插件，能够让原本单薄的VS Code如虎添翼，开发效率顿时提升到一个新的阶段。

python中文是什么意思Jun 24, 2019 pm 02:22 PM

pythn的中文意思是巨蟒、蟒蛇。1989年圣诞节期间，Guido van Rossum在家闲的没事干，为了跟朋友庆祝圣诞节，决定发明一种全新的脚本语言。他很喜欢一个肥皂剧叫Monty Python，所以便把这门语言叫做python。

详细介绍python的numpy模块May 19, 2022 am 11:43 AM

本篇文章给大家带来了关于Python的相关知识，其中主要介绍了关于numpy模块的相关问题，Numpy是Numerical Python extensions的缩写，字面意思是Python数值计算扩展，下面一起来看一下，希望对大家有帮助。

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

2 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hello Kitty Island Adventure: How To Get Giant Seeds

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

How Long Does It Take To Beat Split Fiction?

4 weeks agoByDDD

R.E.P.O. Save File Location: Where Is It & How to Protect It?

4 weeks agoByDDD

Two Point Museum: All Exhibits And Where To Find Them

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

Atom editor mac version download

The most popular open source editor

Dreamweaver Mac version

Visual web development tools

Dreamweaver CS6

Visual web development tools

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

Hot Topics

Where is the login entrance for gmail email?

7383

1628

1357

1267

1216