How to use the Python web crawler requests library-Python Tutorial-php.cn

Home

Backend Development

Python Tutorial

How to use the Python web crawler requests library

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

May 15, 2023 am 10:34 AM

pythonrequests

1. What is a web crawler

Simply put, it is to build a program to download, parse and organize data from the Internet in an automated way.

Just like when we browse the web, we will copy and paste the content we are interested in into our notebooks to facilitate reading and browsing next time-the web crawler helps us automatically complete these contents

Of course, if you encounter some websites that cannot be copied and pasted - the web crawler can show its power even more

Why we need web crawlers

When we need to do some data analysis - and many times these data are stored in web pages, and manual downloading takes too long. At this time, we need web crawlers to help us automatically crawl these data (of course we will filter out those data that are not available on the web page). Things to use)

Applications of web crawlers

Accessing and collecting network data has a very wide range of applications, many of which belong to the field of data science. Let’s take a look at the following examples:

Taobao sellers need to find useful positive and negative information from the massive reviews to help them further capture the hearts of customers and analyze customers’ shopping psychology. Some scholars crawled on social media such as Twitter and Weibo. Information to build a data set to build a predictive model for identifying depression and suicidal thoughts - so that more people in need can get help - of course we also need to consider privacy-related issues - But it's cool isn't it?

As an artificial intelligence engineer, they crawled the pictures of the volunteers’ preferences from Ins to train the deep learning model to predict whether the given images would be liked by the volunteers. ;Mobile phone manufacturers incorporate these models into their picture apps and push them to you. The data scientists of the e-commerce platform crawl the information of the products browsed by users, conduct analysis and prediction, so as to push the products that the users want to know and buy the most

Yes! Web crawlers are widely used, ranging from daily batch crawling of high-definition wallpapers and pictures to data sources for artificial intelligence, deep learning, and business strategy formulation.

This era is the era of data, and data is the "new oil"

2. Network transmission protocol HTTP

Yes, when it comes to web crawlers, one thing that cannot be avoided is Of course, for this HTTP, we don’t need to understand all aspects of the protocol definition in detail like network engineers, but as an introduction, we still have to have a certain understanding.

The International Organization for Standardization ISO maintains the open communication system interconnection reference model OSI, and this model divides the computer communication structure into seven layers

Physical layer: including Ethernet protocol, USB protocol, Bluetooth protocol, etc.
Data link layer: including Ethernet protocol
Network layer: including IP protocol
Transport layer: including TCP, UDP protocol
Session layer: Contains protocols for opening/closing and managing sessions
Presentation layer: Contains protocols for protecting formatting and translating data
Application layer: Contains HTTP and DNS network service protocols

Now let’s take a look at what the HTTP request and response look like (because it will be involved later Define request headers) A general request message consists of the following content:

Request line
Multiple request headers
Empty line
Optional message body

Specific request message:

GET https://www.baidu.com/?tn=80035161_1_dg HTTP/1.1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: zh-Hans-CN,zh-Hans;q=0.8,en-GB;q=0.5,en;q=0.3
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.18362
Accept-Encoding: gzip, deflate, br
Host: www.baidu.com
Connection: Keep-Alive

This is access Of course, we don’t need to know many of the details in Baidu’s request, because python’s request package will help us complete our crawling

Of course we can also view the information returned by the webpage for our request:

HTTP/1.1 200 OK //这边的状态码为200表示我们的请求成功
Bdpagetype: 2
Cache-Control: private
Connection: keep-alive
Content-Encoding: gzip
Content-Type: text/html;charset=utf-8
Date: Sun, 09 Aug 2020 02:57:00 GMT
Expires: Sun, 09 Aug 2020 02:56:59 GMT
X-Ua-Compatible: IE=Edge,chrome=1
Transfer-Encoding: chunked

3. Requests library (Students who don’t like theoretical knowledge can come here directly)

We know that Python also has other preset libraries for handling HTTP - urllib and urllib3, but the requests library is easier to learn - the code is simpler and easier to understand. Of course, when we successfully crawl the web page and extract the things we are interested in, we will mention another very useful library - Beautiful Soup - this is More later

1. Installation of requests library

Here we can directly find the .whl file of requests to install, or we can directly use pip to install it (of course, if you have pycharm, you can directly install it from The environment inside is loading and downloading)

2. Actual combat

Now we start to formally crawl the webpage

The code is as follows:

import requests
target = &#39;https://www.baidu.com/&#39;
get_url = requests.get(url=target)
print(get_url.status_code)
print(get_url.text)

Output results

200 //返回状态码200表示请求成功
<!DOCTYPE html>//这里删除了很多内容，实际上输出的网页信息比这要多得多
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;
charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge>
<meta content=always name=referrer>
<link rel=stylesheet type=text/css 
src=//www.baidu.com/img/gs.gif> 
</p> </div> </div> </div> </body> </html>

The above five lines of code have done a lot. We can already crawl all the HTML content of the web page

The first line of code: Load the requests library. The second line of code: Give the website number that needs to be crawled. Three lines of code: The general format of requests using requests is as follows:

对象 = requests.get(url=你想要爬取的网站地址)

The fourth line of code: Returns the status code of the request. The fifth line of code: Outputs the corresponding content body

Of course we can also print More content

import requests

target = &#39;https://www.baidu.com/&#39;
get_url = requests.get(url=target)
# print(get_url.status_code)
# print(get_url.text)
print(get_url.reason)//返回状态
print(get_url.headers)
//返回HTTP响应中包含的服务器头的内容（和上面展示的内容差不多）
print(get_url.request)
print(get_url.request.headers)//返回请求中头的内容

OK
{&#39;Cache-Control&#39;: &#39;private, no-cache, no-store, proxy-revalidate, no-transform&#39;, 
&#39;Connection&#39;: &#39;keep-alive&#39;, 
&#39;Content-Encoding&#39;: &#39;gzip&#39;, 
&#39;Content-Type&#39;: &#39;text/html&#39;, 
&#39;Date&#39;: &#39;Sun, 09 Aug 2020 04:14:22 GMT&#39;,
&#39;Last-Modified&#39;: &#39;Mon, 23 Jan 2017 13:23:55 GMT&#39;, 
&#39;Pragma&#39;: &#39;no-cache&#39;, 
&#39;Server&#39;: &#39;bfe/1.0.8.18&#39;, 
&#39;Set-Cookie&#39;: &#39;BDORZ=27315; max-age=86400; domain=.baidu.com; path=/&#39;, &#39;Transfer-Encoding&#39;: &#39;chunked&#39;}
<PreparedRequest [GET]>
{&#39;User-Agent&#39;: &#39;python-requests/2.22.0&#39;, 
&#39;Accept-Encoding&#39;: &#39;gzip, deflate&#39;, 
&#39;Accept&#39;: &#39;*/*&#39;, 
&#39;Connection&#39;: &#39;keep-alive&#39;}

The above is the detailed content of How to use the Python web crawler requests library. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:亿速云. If there is any infringement, please contact admin@php.cn delete

Python: A Deep Dive into Compilation and InterpretationMay 12, 2025 am 12:14 AM

Pythonusesahybridmodelofcompilationandinterpretation:1)ThePythoninterpretercompilessourcecodeintoplatform-independentbytecode.2)ThePythonVirtualMachine(PVM)thenexecutesthisbytecode,balancingeaseofusewithperformance.

Is Python an interpreted or a compiled language, and why does it matter?May 12, 2025 am 12:09 AM

Pythonisbothinterpretedandcompiled.1)It'scompiledtobytecodeforportabilityacrossplatforms.2)Thebytecodeistheninterpreted,allowingfordynamictypingandrapiddevelopment,thoughitmaybeslowerthanfullycompiledlanguages.

For Loop vs While Loop in Python: Key Differences ExplainedMay 12, 2025 am 12:08 AM

Forloopsareidealwhenyouknowthenumberofiterationsinadvance,whilewhileloopsarebetterforsituationswhereyouneedtoloopuntilaconditionismet.Forloopsaremoreefficientandreadable,suitableforiteratingoversequences,whereaswhileloopsoffermorecontrolandareusefulf

For and While loops: a practical guideMay 12, 2025 am 12:07 AM

Forloopsareusedwhenthenumberofiterationsisknowninadvance,whilewhileloopsareusedwhentheiterationsdependonacondition.1)Forloopsareidealforiteratingoversequenceslikelistsorarrays.2)Whileloopsaresuitableforscenarioswheretheloopcontinuesuntilaspecificcond

Python: Is it Truly Interpreted? Debunking the MythsMay 12, 2025 am 12:05 AM

Pythonisnotpurelyinterpreted;itusesahybridapproachofbytecodecompilationandruntimeinterpretation.1)Pythoncompilessourcecodeintobytecode,whichisthenexecutedbythePythonVirtualMachine(PVM).2)Thisprocessallowsforrapiddevelopmentbutcanimpactperformance,req

Python concatenate lists with same elementMay 11, 2025 am 12:08 AM

ToconcatenatelistsinPythonwiththesameelements,use:1)the operatortokeepduplicates,2)asettoremoveduplicates,or3)listcomprehensionforcontroloverduplicates,eachmethodhasdifferentperformanceandorderimplications.

Interpreted vs Compiled Languages: Python's PlaceMay 11, 2025 am 12:07 AM

Pythonisaninterpretedlanguage,offeringeaseofuseandflexibilitybutfacingperformancelimitationsincriticalapplications.1)InterpretedlanguageslikePythonexecuteline-by-line,allowingimmediatefeedbackandrapidprototyping.2)CompiledlanguageslikeC/C transformt

For and While loops: when do you use each in python?May 11, 2025 am 12:05 AM

Useforloopswhenthenumberofiterationsisknowninadvance,andwhileloopswheniterationsdependonacondition.1)Forloopsareidealforsequenceslikelistsorranges.2)Whileloopssuitscenarioswheretheloopcontinuesuntilaspecificconditionismet,usefulforuserinputsoralgorit

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Roblox: Grow A Garden - Complete Mutation Guide

3 weeks agoByDDD

Roblox: Bubble Gum Simulator Infinity - How To Get And Use Royal Keys

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

How to fix KB5055612 fails to install in Windows 10?

3 weeks agoByDDD

Nordhold: Fusion System, Explained

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Mandragora: Whispers Of The Witch Tree - How To Unlock The Grappling Hook

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

SublimeText3 Chinese version

Chinese version, very easy to use

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.

Hot Topics

1665

1424

1321

1269

1249