迷茫2017-04-18 10:00:30
Thank God I just found the solution
A summary of some techniques for using python crawlers to crawl websites - Python - Bole Online http://python.jobbole.com/81997/ The original text is here
def request(url, cookie='xxx', retries=5):
ret = urlparse.urlparse(url) # Parse input URL
if ret.scheme == 'http':
conn = httplib.HTTPConnection(ret.netloc)
elif ret.scheme == 'https':
conn = httplib.HTTPSConnection(ret.netloc)
url = ret.path
if ret.query: url += '?' + ret.query
if ret.fragment: url += '#' + ret.fragment
if not url: url = '/'
try:
conn.request(method='GET', url=url, headers={'Cookie': cookie})
res = conn.getresponse()
except Exception, e:
print e.message
if retries > 0:
return request(url=url, retries= retries - 1)
else:
print 'GET Failed'
return ''
else:
pass
finally:
pass
if res.status != 200:
return None
return res.read()
The principle is to use a retries variable to store the number of retries, and then recurse itself every time an exception is handled and set the number of retries to -1. If it is determined that the number of retries is less than 0, return directly and print a failure log
大家讲道理2017-04-18 10:00:30
Recursively calling itself to perform retrycount to limit is the most direct method.
But there is a problem:
If the other party's address only fails temporarily, such as restarting the service. Retrying immediately still failed. The time for retrying 5 times was very short. When the other party's service was ready, the request was passed because it was retried 5 times
The mechanism I use is to retry five times, waiting for 30s, 1 minute, 10 minutes, 30 minutes, and 1 hour. If it still fails, it is considered to have failed.
Of course, this usage is based on specific business logic. Different business needs have different requirements for requests.