Home  >  Article  >  Backend Development  >  What does crawler python mean?

What does crawler python mean?

藏色散人
藏色散人Original
2019-06-25 10:13:532314browse

What does crawler python mean?

What does crawler python mean?

Crawler, also known as web crawler, mainly refers to scripts and programs that collect data from the Internet. It is the basis for data analysis and data mining.

The so-called crawler refers to obtaining data information that is useful to us in a given URL (website), achieving a large amount of data acquisition through code, and obtaining relevant rules through later data sorting and calculation, and Industry trends and other information.

Python crawler architecture mainly consists of five parts, namely scheduler, URL manager, web page downloader, web page parser, and application (crawled valuable data).

Scheduler:

Equivalent to the CPU of a computer, mainly responsible for scheduling the coordination between the URL manager, downloader, and parser.

URL manager:

Includes URL addresses to be crawled and URL addresses that have been crawled, preventing repeated crawling of URLs and loop crawling of URLs, and implementing URL The manager is mainly implemented in three ways, through memory, database, and cache database.

Webpage downloader:

Download a webpage by passing in a URL address and convert the webpage into a string. The webpage downloader has urllib2 (Python official basic module ) Including the need for login, proxy, and cookie, requests (third-party package)

Web page parser:

To parse a web page string, you can follow our Requirements to extract our useful information can also be parsed according to the parsing method of the DOM tree. Web page parsers include regular expressions (intuitively, convert web pages into strings to extract valuable information through fuzzy matching. When the document is complex, this method will be very difficult to extract data), html. parser (that comes with Python), beautifulsoup (a third-party plug-in, you can use the html.parser that comes with Python for parsing, or you can use lxml for parsing, which is more powerful than the other ones), lxml (a third-party plug-in , can parse xml and HTML), html.parser, beautifulsoup and lxml are all parsed in the form of DOM tree.

Application:

is an application composed of useful data extracted from web pages.

Related recommendations: "Python Tutorial"

The above is the detailed content of What does crawler python mean?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn