Home > Article > Backend Development > What is a crawler? Detailed explanation of crawlers
Hello everyone, I have been learning Python recently. I also encountered some problems during the study and gained some experience. I will systematically organize my learning here. If you are interested in learning crawlers, you can read these articles. For reference, everyone is welcome to share their learning experiences.
Python version: 2.7, please find other blog posts for Python 3.
First of all, what is a crawler?
Web crawlers (also known as web spiders, web robots, and more commonly known as web chasers in the FOAF community) are programs that automatically capture information from the World Wide Web according to certain rules. Or script.
According to my experience, to learn Python crawler, we need to learn the following points:
Basic knowledge of Python
Usage of urllib and urllib2 libraries in Python
Python regular expression
Python crawler framework Scrapy
More advanced functions of Python crawler
1. Python basic learning
First, we If you want to use Python to write a crawler, you must understand the basics of Python. When a tall building rises from the ground, you must not forget the foundation. Haha, then I will share some Python tutorials that I have watched, and friends can use them as a reference.
1) MOOC.com Python Tutorial
I used to read some basic grammar on MOOC.com, with some exercises attached to it. After learning it, I can use it as an exercise. I feel that the effect is quite good. It’s good, but it’s a little regrettable that the content is basically the most basic. If you want to get started, this is it
Learning URL: MOOC.com Python Tutorial
2) Liao Xuefeng Python Tutorial
Later, I discovered Teacher Liao’s Python tutorial. It was very easy to understand and felt very good. If you want to know more about Python, take a look at this.
Learning URL: Liao Xuefeng Python Tutorial
3) Concise Python Tutorial
There is another one I have read, Concise Python Tutorial, and I think it is pretty good
Learning URL: Concise Python Tutorial
4) Wang Hai’s Laboratory
This is my undergraduate laboratory senior. I referred to his article when I started, and I re-summarized it myself. , and later these series of articles added some content based on his work.
Learning URL: Wang Hai’s Laboratory
2. Usage of Python urllib and urllib2 libraries
The urllib and urllib2 libraries are the most basic libraries for learning Python crawlers. Use this With the library, we can get the content of the web page and use regular expressions to extract and analyze the content to get the results we want. I will share this with you during the learning process.
3.Python regular expression
Python regular expression is a powerful weapon used to match strings. Its design idea is to use a descriptive language to define a rule for a string. Any string that conforms to the rule is considered to "match". Otherwise, the string is illegal. This will be shared in a later blog post.
4. Crawler framework Scrapy
If you are a Python master and have mastered basic crawler knowledge, then look for a Python framework. The framework I chose is the Scrapy framework. What powerful functions does this framework have? The following is its official introduction:
Built-in support for HTML, XML source data selection and extraction
Provides a series of reusable filters (i.e. Item Loaders) shared between spiders, for There is built-in support for intelligent processing of crawled data.
Provides built-in support for multiple formats (JSON, CSV, XML) and multiple storage backends (FTP, S3, local file system) through feed export
Provides a media pipeline that can automatically download crawled data Pictures (or other resources) in .
High scalability. You can customize and implement your functions by using signals and designed APIs (middleware, extensions, pipelines).
Built-in middleware and extensions provide support for the following functions:
cookies and session handling
HTTP compression
HTTP authentication
HTTP caching
user-agent simulation
robots. txt
Crawling depth limit
Provides automatic detection and robust coding support for non-standard or wrong coding declarations in non-English languages.
Supports generating crawlers based on templates. Keep your code more consistent across large projects while speeding up crawler creation. See the genspider command for details.
Provides an scalable status collection tool for performance evaluation and failure detection under multiple crawlers.
Provides an interactive shell terminal, which provides great convenience for you to test XPath expressions, write and debug crawlers
Provides System service, simplifying deployment and operation in the production environment
Built-in Web service, allowing you to Can monitor and control your machine
Built-in Telnet terminal, by hooking into the Python terminal in the Scrapy process, you can view and debug the crawler
Logging provides convenience for you to capture errors during the crawling process
Support Sitemaps crawling
DNS resolver with cache
The above is the detailed content of What is a crawler? Detailed explanation of crawlers. For more information, please follow other related articles on the PHP Chinese website!