Home > Article > Backend Development > What modules do python crawlers need to call?
Python commonly used crawler modules:
Related recommendations: python crawler library and related tools
Python standard library ——urllib module
Function: Open URL and http protocol
Note: The urllib library and urilib2 library in python 3.x were merged into the urllib library. Among them, urllib2.urlopen() becomes urllib.request.urlopen(), urllib2.Request() becomes urllib.request.Request()
urllib request returns the web page
urllib. request.urlopen
urllib.request.open(url[,data,[timeout,[cafile,[capth[,cadefault,[context]]]]]])
urllib.requset.urlioen can open HTTP (main), HTTPS, FTP, protocol URL
ca Authentication
data Submit URL in post mode Use
url to submit the network address (the entire front end requires a protocol name and the back end requires port http:/192.168.1.1:80)
timeout timeout setting
Function return object There are three additional methods
geturl() returns the url information of the response
Commonly used with url redirection info() returns the basic information of the response
getcode() returns the response Status code
Example:
#coding:utf-8 import urllib.request import time import platform #清屏函数(无关紧要 可以不写) def clear(): print(u"内容过多 3秒后清屏") time.sleep(3) OS = platform.system() if (OS == u'Windows'): os.system('cls') else: os.system('clear') #访问函数 def linkbaidu(): url = 'http://www.baidu.com' try: response = urllib.request.urlopen(url,timeout=3) except urllib.URLError: print(u'网络地址错误') exit() with open('/home/ifeng/PycharmProjects/pachong/study/baidu.txt','w') as fp: response = urllib.request.urlopen(url,timeout=3) fp.write(response.read()) print(u'获取url信息,response.geturl()\n:%s'%response.getrul()) print(u'获取返回代码,response.getcode()\n:%s' % response.getcode()) print(u'获取返回信息,response.info()\n:%s' % response.info()) print(u"获取的网页信息经存与baidu.txt") if __name__ =='main': linkbaidu()
Python standard library – logging module
The logging module can replace the function of the print function and output the standard to the log file Save it and use the loggin module to partially replace the debug
re module
regular expression
sys module
System related modules
sys.argv(returns a list containing all command lines)
sys.exit(exit the program)
Scrapy framework
Using urllib and re together is too backward. Now the mainstream is Scrapy framework
For more Python related technical articles, please visit the Python Tutorial column to learn !
The above is the detailed content of What modules do python crawlers need to call?. For more information, please follow other related articles on the PHP Chinese website!