search
HomeBackend DevelopmentPHP Tutorial[Python] Web crawler (3): Exception handling and classification of HTTP status codes

Let’s first talk about HTTP exception handling.
When urlopen cannot handle a response, a urlError is generated.
However, common Python APIs exceptions such as ValueError, TypeError, etc. will also occur at the same time.
HTTPError is a subclass of urlError, usually generated in specific HTTP URLs.

1.URLError
Usually, URLError occurs when there is no network connection (no routing to a specific server), or the server does not exist.

In this case, the exception will also have the "reason" attribute, which is a tuple (can be understood as an immutable array),

contains an error number and an error message.

Let’s build a urllib2_test06.py to experience exception handling:

[python] view plaincopy

  1. import urllib2
  2. req = urllib2.Request('http://www.baibai.com')
  3. try: urllib2.urlopen(req)
  4. except urllib2.URLError, e:
  5. print e.reason

Press F5 and you can see the printed content is:

[Errno 11001] getaddrinfo failed

That is to say, the error number is 11001 and the content is getaddrinfo failed


2 .HTTPError
Every HTTP response object response on the server contains a numeric "status code".

Sometimes the status code indicates that the server cannot complete the request. The default handler handles part of this response for you.

For example: If the response is a "redirect" and the client needs to obtain the document from another address, urllib2 will handle it for you.

Others that cannot be handled, urlopen will generate an HTTPError.

Typical errors include "404" (page not found), "403" (request forbidden), and "401" (request with verification).

HTTP status code indicates the status of the response returned by the HTTP protocol.

For example, if the client sends a request to the server, if the requested resource is successfully obtained, the returned status code is 200, indicating that the response is successful.

If the requested resource does not exist, a 404 error is usually returned.

HTTP status codes are usually divided into 5 types, starting with five numbers from 1 to 5 and consisting of 3-digit integers:

---------------- -------------------------------------------------- ----------------------------------

200: Request successful Processing method: Get the response content and process it

201: The request is completed, resulting in the creation of a new resource. The URI of the newly created resource can be obtained in the response entity. Processing method: Will not be encountered in the crawler.

202: The request is accepted, but the processing has not yet been completed. Processing method: Blocking and waiting.

204: Already implemented on the server side. The request was made, but no new information was returned. If the client is a user agent, it does not need to update its own document view for this purpose. Processing method: Discard

300: This status code is not directly used by HTTP/1.0 applications, but is only used as the default interpretation of 3XX type responses. There are multiple requested resources available. Processing method: If it can be processed in the program, it will be further processed. If it cannot be processed in the program, it will be discarded. 301: The requested resource will be assigned a permanent URL, so that this resource can be accessed through this URL in the future. Processing method : Redirect to the assigned URL
302: The requested resource is temporarily saved at a different URL Processing method: Redirect to the temporary URL

304 The requested resource is not updated Processing method: discard

400 Illegal request Processing method: discard

401 Unauthorized Processing method: discard

403 Prohibited Processing method: discard

404 None Found Processing method: discard

5XX The status code starting with "5" indicates that the server has found an error and cannot continue to execute the request Processing method: discard

----- -------------------------------------------------- ----------------------------------------

HTTPError instance will be generated There is an integer 'code' attribute, which is the relevant error number sent by the server.

Error Codes
Because the default processor handles the redirection (numbers other than 300), and numbers in the 100-299 range indicate success, you can only see error numbers 400-599.
BaseHTTPServer.BaseHTTPRequestHandler.response is a very useful response number dictionary, showing all response numbers used by the HTTP protocol.

When an error number is generated, the server returns an HTTP error number and an error page.

You can use HTTPError instance as the response object response returned by the page.

This means that like the error attribute, it also contains read, geturl, and info methods.

Let’s build a urllib2_test07.py to experience it:

[python] view plaincopy

  1. import urllib2
  2. req = urllib2.Request('http://bbs.csdn.net/callmewhy')
  3. try :
  4. urllib2.urlopen(req)
  5. except urllib2.URLError, e:
  6. print e.code
  7. ​​​​​#print e.read()​​

Press F5 and you can see that the 404 error code is output, which means that this page is not found.


3.Wrapping

So if you want to prepare for HTTPError or URLError, there will be two basic ways. It is recommended to use the second one.

Let’s build a urllib2_test08.py to demonstrate the first exception handling solution:

[python] view plaincopy

  1. from urllib2 import Request, urlopen, URLError, HTTPError  
  2.   
  3. req = Request('http://bbs.csdn.net/callmewhy')  
  4.   
  5. try:  
  6.   
  7.     response = urlopen(req)  
  8.   
  9. except HTTPError, e:  
  10.   
  11.     print 'The server couldn't fulfill the request.'  
  12.   
  13.     print 'Error code: ', e.code  
  14.   
  15. except URLError, e:  
  16.   
  17.     print 'We failed to reach a server.'  
  18.   
  19.     print 'Reason: ', e.reason  
  20.   
  21. else:  
  22.     print 'No exception was raised.'  
  23.     # everything is fine  

和其他语言相似,try之后捕获异常并且将其内容打印出来。

这里要注意的一点,except HTTPError 必须在第一个,否则except URLError将同样接受到HTTPError 
因为HTTPError是URLError的子类,如果URLError在前面它会捕捉到所有的URLError(包括HTTPError )。



我们建一个urllib2_test09.py来示范一下第二种异常处理的方案:

[python] view plaincopy

  1. from urllib2 import Request, urlopen, URLError, HTTPError  
  2.   
  3. req = Request('http://bbs.csdn.net/callmewhy')  
  4.     
  5. try:    
  6.     
  7.     response = urlopen(req)    
  8.     
  9. except URLError, e:    
  10.   
  11.     if hasattr(e, 'code'):    
  12.     
  13.         print 'The server couldn't fulfill the request.'    
  14.     
  15.         print 'Error code: ', e.code    
  16.   
  17.     elif hasattr(e, 'reason'):    
  18.     
  19.         print 'We failed to reach a server.'    
  20.     
  21.         print 'Reason: ', e.reason    
  22.     
  23.     
  24. else:    
  25.     print 'No exception was raised.'    
  26.     # everything is fine    

以上就介绍了[Python]网络爬虫(三):异常的处理和HTTP状态码的分类,包括了方面的内容,希望对PHP教程有兴趣的朋友有所帮助。

Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
How to make PHP applications fasterHow to make PHP applications fasterMay 12, 2025 am 12:12 AM

TomakePHPapplicationsfaster,followthesesteps:1)UseOpcodeCachinglikeOPcachetostoreprecompiledscriptbytecode.2)MinimizeDatabaseQueriesbyusingquerycachingandefficientindexing.3)LeveragePHP7 Featuresforbettercodeefficiency.4)ImplementCachingStrategiessuc

PHP Performance Optimization Checklist: Improve Speed NowPHP Performance Optimization Checklist: Improve Speed NowMay 12, 2025 am 12:07 AM

ToimprovePHPapplicationspeed,followthesesteps:1)EnableopcodecachingwithAPCutoreducescriptexecutiontime.2)ImplementdatabasequerycachingusingPDOtominimizedatabasehits.3)UseHTTP/2tomultiplexrequestsandreduceconnectionoverhead.4)Limitsessionusagebyclosin

PHP Dependency Injection: Improve Code TestabilityPHP Dependency Injection: Improve Code TestabilityMay 12, 2025 am 12:03 AM

Dependency injection (DI) significantly improves the testability of PHP code by explicitly transitive dependencies. 1) DI decoupling classes and specific implementations make testing and maintenance more flexible. 2) Among the three types, the constructor injects explicit expression dependencies to keep the state consistent. 3) Use DI containers to manage complex dependencies to improve code quality and development efficiency.

PHP Performance Optimization: Database Query OptimizationPHP Performance Optimization: Database Query OptimizationMay 12, 2025 am 12:02 AM

DatabasequeryoptimizationinPHPinvolvesseveralstrategiestoenhanceperformance.1)Selectonlynecessarycolumnstoreducedatatransfer.2)Useindexingtospeedupdataretrieval.3)Implementquerycachingtostoreresultsoffrequentqueries.4)Utilizepreparedstatementsforeffi

Simple Guide: Sending Email with PHP ScriptSimple Guide: Sending Email with PHP ScriptMay 12, 2025 am 12:02 AM

PHPisusedforsendingemailsduetoitsbuilt-inmail()functionandsupportivelibrarieslikePHPMailerandSwiftMailer.1)Usethemail()functionforbasicemails,butithaslimitations.2)EmployPHPMailerforadvancedfeatureslikeHTMLemailsandattachments.3)Improvedeliverability

PHP Performance: Identifying and Fixing BottlenecksPHP Performance: Identifying and Fixing BottlenecksMay 11, 2025 am 12:13 AM

PHP performance bottlenecks can be solved through the following steps: 1) Use Xdebug or Blackfire for performance analysis to find out the problem; 2) Optimize database queries and use caches, such as APCu; 3) Use efficient functions such as array_filter to optimize array operations; 4) Configure OPcache for bytecode cache; 5) Optimize the front-end, such as reducing HTTP requests and optimizing pictures; 6) Continuously monitor and optimize performance. Through these methods, the performance of PHP applications can be significantly improved.

Dependency Injection for PHP: a quick summaryDependency Injection for PHP: a quick summaryMay 11, 2025 am 12:09 AM

DependencyInjection(DI)inPHPisadesignpatternthatmanagesandreducesclassdependencies,enhancingcodemodularity,testability,andmaintainability.Itallowspassingdependencieslikedatabaseconnectionstoclassesasparameters,facilitatingeasiertestingandscalability.

Increase PHP Performance: Caching Strategies & TechniquesIncrease PHP Performance: Caching Strategies & TechniquesMay 11, 2025 am 12:08 AM

CachingimprovesPHPperformancebystoringresultsofcomputationsorqueriesforquickretrieval,reducingserverloadandenhancingresponsetimes.Effectivestrategiesinclude:1)Opcodecaching,whichstorescompiledPHPscriptsinmemorytoskipcompilation;2)DatacachingusingMemc

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

MinGW - Minimalist GNU for Windows

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

mPDF

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

Atom editor mac version download

Atom editor mac version download

The most popular open source editor