Home >Backend Development >Python Tutorial >What modules do python crawlers need to call?

What modules do python crawlers need to call?

尚
Original
2019-07-11 09:13:044126browse

What modules do python crawlers need to call?

Python commonly used crawler modules:

Related recommendations: python crawler library and related tools

Python standard library ——urllib module

Function: Open URL and http protocol

Note: The urllib library and urilib2 library in python 3.x were merged into the urllib library. Among them, urllib2.urlopen() becomes urllib.request.urlopen(), urllib2.Request() becomes urllib.request.Request()

urllib request returns the web page

urllib. request.urlopen

urllib.request.open(url[,data,[timeout,[cafile,[capth[,cadefault,[context]]]]]])

urllib.requset.urlioen can open HTTP (main), HTTPS, FTP, protocol URL

ca Authentication

data Submit URL in post mode Use

url to submit the network address (the entire front end requires a protocol name and the back end requires port http:/192.168.1.1:80)

timeout timeout setting

Function return object There are three additional methods

geturl() returns the url information of the response

Commonly used with url redirection info() returns the basic information of the response

getcode() returns the response Status code

Example:

#coding:utf-8
import urllib.request
import time
import platform


#清屏函数(无关紧要 可以不写)
def clear():
    print(u"内容过多 3秒后清屏")
    time.sleep(3)
    OS = platform.system()
    if (OS == u'Windows'):
        os.system('cls')
    else:
        os.system('clear')
#访问函数
def linkbaidu():
    url = 'http://www.baidu.com'
    try:
        response = urllib.request.urlopen(url,timeout=3)
    except urllib.URLError:
        print(u'网络地址错误')
        exit()
    with open('/home/ifeng/PycharmProjects/pachong/study/baidu.txt','w') as fp:
        response = urllib.request.urlopen(url,timeout=3)
        fp.write(response.read())
    print(u'获取url信息,response.geturl()\n:%s'%response.getrul())
    print(u'获取返回代码,response.getcode()\n:%s' % response.getcode())
    print(u'获取返回信息,response.info()\n:%s' % response.info())
    print(u"获取的网页信息经存与baidu.txt")


if __name__ =='main':
    linkbaidu()

Python standard library – logging module

The logging module can replace the function of the print function and output the standard to the log file Save it and use the loggin module to partially replace the debug

re module

regular expression

sys module

System related modules

sys.argv(returns a list containing all command lines)

sys.exit(exit the program)

Scrapy framework

Using urllib and re together is too backward. Now the mainstream is Scrapy framework

For more Python related technical articles, please visit the Python Tutorial column to learn !

The above is the detailed content of What modules do python crawlers need to call?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn