Home  >  Article  >  Backend Development  >  Why choose to use python as a crawler?

Why choose to use python as a crawler?

尚
Original
2019-07-06 14:43:113669browse

Why choose to use python as a crawler?

What is a web crawler?

A web crawler is a program that automatically extracts web pages. It downloads web pages from the World Wide Web for search engines and is an important component of search engines. The traditional crawler starts from the URL of one or several initial web pages and obtains the URL on the initial web page. During the process of crawling the web page, it continuously extracts new URLs from the current page and puts them into the queue until certain stopping conditions of the system are met

What is the use of crawlers?

As a general search engine web page collector. (google, baidu) is a vertical search engine. Scientific research: online human behavior, online community evolution, human dynamics research, econometric sociology, complex networks, data mining, and other fields require a large amount of data. Web crawlers are A powerful tool for collecting relevant data. Peeping, hacking, sending spam...

Crawler is the first and easiest step for search engines

Why choose to use python as a crawler?

What language should you use to write a crawler?

C,C. Highly efficient and fast, suitable for general search engines to crawl the entire web. Disadvantages: development is slow and writing is stinky and long, for example: Skynet search source code. Scripting language: Perl, Python, Java, Ruby. Simple, easy to learn, and good text processing can facilitate the detailed extraction of web content, but the efficiency is often not high. Is C# suitable for focused crawling of a small number of websites? (Seems like a language that people in information management prefer)

Reasons for choosing Python as a crawler:

Cross-platform, with good support for Linux and windows.

Scientific computing, numerical fitting: Numpy, Scipy

Visualization: 2d: Matplotlib (the drawing is very beautiful), 3d: Mayavi2

Complex network: Networkx statistics: and R language interface: Rpy

Interactive terminal

Rapid development of website

A simple Python crawler

import urllib
import urllib.request

def loadPage(url,filename):
    """
    作用:根据url发送请求,获取html数据;
    :param url:
    :return:
    """
    request=urllib.request.Request(url)
    html1= urllib.request.urlopen(request).read()
    return  html1.decode('utf-8')

def writePage(html,filename):
    """
    作用将html写入本地

    :param html: 服务器相应的文件内容
    :return:
    """
    with open(filename,'w') as f:
        f.write(html)
    print('-'*30)
def tiebaSpider(url,beginPage,endPage):
    """
    作用贴吧爬虫调度器,负责处理每一个页面url;
    :param url:
    :param beginPage:
    :param endPage:
    :return:
    """
    for page in range(beginPage,endPage+1):
        pn=(page - 1)*50
        fullurl=url+"&pn="+str(pn)
        print(fullurl)
        filename='第'+str(page)+'页.html'
        html= loadPage(url,filename)

        writePage(html,filename)



if __name__=="__main__":
    kw=input('请输入你要需要爬取的贴吧名:')
    beginPage=int(input('请输入起始页'))
    endPage=int(input('请输入结束页'))
    url='https://tieba.baidu.com/f?'
    kw1={'kw':kw}
    key = urllib.parse.urlencode(kw1)
    fullurl=url+key
    tiebaSpider(fullurl,beginPage,endPage)

For more Python related technical articles, please Visit the Python Tutorial column to learn!

The above is the detailed content of Why choose to use python as a crawler?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn