How to use Python to make a crawler-Python Tutorial-php.cn

Home

Backend Development

Python Tutorial

How to use Python to make a crawler

高洛峰

Nov 23, 2016 pm 01:23 PM

python

Getting Started" is a good motivation, but it may be slow. If you have a project in your hands or in your mind, then in practice you will be driven by the goal, instead of learning slowly like a learning module.

In addition, if you talk about knowledge If each knowledge point in the system is a point in the graph and the dependency is an edge, then the graph must not be a directed acyclic graph, because the experience of learning A can help you learn B. Therefore, you do not need to learn how. "Getting started", because such a "getting started" point does not exist! What you need to learn is how to make something larger. In the process, you will quickly learn what you need to learn. Of course, you can. The argument is that you need to know python first, otherwise how can you learn python to make a crawler? But in fact, you can learn python in the process of making this crawler :D

I saw the "technique" mentioned in many previous answers - what to use? How does the software crawl? Let me talk about the "Tao" and "Technology" - how the crawler works and how to implement it in python

Let's summarize it briefly:
You need to learn

the basic working principles of the crawler

Basic. http scraping tool, scrapy

Bloom Filter: Bloom Filters by Example

If you need to crawl web pages on a large scale, you need to learn the concept of distributed crawlers. In fact, it is not that mysterious. You only need to learn how to maintain a cluster of machines. The simplest implementation is python-rq: https://github.com/nvie/rq

The combination of rq and Scrapy: darkrho/scrapy-redis · GitHub

Follow-up processing, web page Disjunction (grangier/python-goose · GitHub), storage (Mongodb)

The following is a short story:

Tell me about the experience of climbing down the entire Douban when you wrote a cluster

1) First, you have to do it. Understand how crawlers work.
Imagine you are a spider, and now you are put on the Internet. So, what should you do? No problem, you can just click on it. Start somewhere, for example, the home page of the People's Daily. This is called initial pages, represented by $.

On the home page of the People's Daily, you see various links to that page, so you happily crawled to "domestic". "News" page. Great, now you have finished crawling two pages (homepage and domestic news)! For now, don't worry about how to deal with the page you crawled down. Just imagine that you copied this page completely into an html Put it on you.

Suddenly you find that on the domestic news page, there is a link back to the "Home Page", you must know that you don't have to crawl back, because you have already seen it. Ah. So, you need to use your brain to save the addresses of the pages you have viewed. In this way, every time you see a new link that may need to be crawled, you first check whether you have already visited this page address in your mind. If you've been there, don't go.

Okay, in theory, if all pages can be reached from the initial page, then it can be proved that you can definitely crawl all web pages.

So how to implement it in python?
Very simple
import Queueinitial_page = "http://www.renminribao.com"url_queue = Queue.Queue()seen = set()seen.insert(initial_page)url_queue.put(initial_page)while(True):

#Keep going until everything is dead
if url_queue.size()>0:
current_url = url_queue.get() #Get the first url in the queue
store(current_url) #Store the web page represented by this url _ For next_url in extract_urls (Current_url): #

IF NEXT_URL NORL NORL NORL NORL NOT In SEEN:
Seen.put (next_url)
Url_queue.put (next_url)

Else: a Break
is already written very well Pseudocode.

The backbone of all crawlers is here. Let’s analyze why crawlers are actually a very complicated thing - search engine companies usually have a whole team to maintain and develop them.

2) Efficiency
If you directly process the above code and run it directly, it will take you a whole year to crawl down the entire Douban content. Not to mention that search engines like Google need to crawl down the entire web.

What’s the problem? There are too many web pages that need to be crawled, and the above code is too slow. Assume that there are N websites in the entire network, then analyze the complexity of reuse judgment is N*log(N), because all web pages need to be traversed once, and reusing set every time requires log(N) complexity. OK, OK, I know that python's set implementation is hash - but this is still too slow, at least the memory usage is not efficient.

What is the usual way to determine weight? Bloom Filter. Simply put, it is still a hash method, but its characteristic is that it can use fixed memory (does not grow with the number of URLs) to determine whether the URL is already in the set with O(1) efficiency. Unfortunately, there is no such thing as a free lunch. The only problem is that if the URL is not in the set, BF can be 100% sure that the URL has not been viewed. But if this URL is in the set, it will tell you: This URL should have already appeared, but I have 2% uncertainty. Note that the uncertainty here can become very small when the memory you allocate is large enough. A simple tutorial: Bloom Filters by Example

Notice this feature. If the URL has been viewed, it may be viewed repeatedly with a small probability (it doesn’t matter, you won’t be exhausted if you view it more). But if it has not been viewed, it will definitely be viewed (this is very important, otherwise we will miss some web pages!). [IMPORTANT: There is a problem with this paragraph, please skip it for now]

Okay, now we are close to the fastest way to deal with the weight judgment. Another bottleneck - you only have one machine. No matter how big your bandwidth is, as long as the speed of your machine downloading web pages is the bottleneck, then you can only speed up this speed. If one machine isn't enough - use many! Of course, we assume that each machine has reached maximum efficiency - using multi-threading (for Python, multi-process).

3) Cluster crawling
When crawling Douban, I used a total of more than 100 machines to run around the clock for a month. Imagine if you only use one machine, you will have to run it for 100 months...

So, assuming you have 100 machines available now, how to use python to implement a distributed crawling algorithm?

We call 99 of the 100 machines with smaller computing power slaves, and the other larger machine is called master. Then looking back at the url_queue in the above code, if we can put this queue on this master On the machine, all slaves can communicate with the master through the network. Whenever a slave completes downloading a web page, it requests a new web page from the master to crawl. Every time the slave captures a new web page, it sends all the links on this web page to the master's queue. Similarly, the bloom filter is also placed on the master, but now the master only sends URLs that have not been visited to the slave. The Bloom Filter is placed in the memory of the master, and the visited URL is placed in Redis running on the master, thus ensuring that all operations are O(1). (At least the amortization is O(1). For the access efficiency of Redis, please see: LINSERT – Redis)

Consider how to implement it in python:

Install scrapy on each slave, then each machine will become a capable machine. Get a capable slave and install Redis and rq on the master to use as a distributed queue.

The code is written as

#slave.py
current_url = request_from_master()
to_send = []
for next_url in extract_urls(current_url):
    to_send.append(next_url)
store(current_url);
send_to_master(to_send)
#master.py
distributed_queue = DistributedQueue()
bf = BloomFilter()
initial_pages = "www.renmingribao.com"
while(True):
    if request == &#39;GET&#39;:
        if distributed_queue.size()>0:
            send(distributed_queue.get())
        else:
            break
    elif request == &#39;POST&#39;:
        bf.put(request.url)

Okay, in fact, as you can imagine, someone has already written what you need: darkrho/scrapy-redis · GitHub

4) Outlook and post-processing

Although the above uses a lot of " "Simple", but it is not easy to actually implement a commercial-scale crawler. The above code can be used to crawl an entire website without much problem.

But if you need these follow-up processing, such as 🎜🎜🎜effective storage (how the database should be arranged) 🎜🎜effective duplication judgment (here refers to webpage duplication judgment, we don’t want to compare the People’s Daily and Damin who plagiarized it) Crawled through daily newspapers)🎜

Effective information extraction (such as how to extract all the addresses on the web page, "Zhonghua Road, Fenjin Road, Chaoyang District"), search engines usually do not need to store all the information, such as why should I save pictures...

Timely updates (predict how often this page will be updated)

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

The Main Purpose of Python: Flexibility and Ease of UseApr 17, 2025 am 12:14 AM

Python's flexibility is reflected in multi-paradigm support and dynamic type systems, while ease of use comes from a simple syntax and rich standard library. 1. Flexibility: Supports object-oriented, functional and procedural programming, and dynamic type systems improve development efficiency. 2. Ease of use: The grammar is close to natural language, the standard library covers a wide range of functions, and simplifies the development process.

Python: The Power of Versatile ProgrammingApr 17, 2025 am 12:09 AM

Python is highly favored for its simplicity and power, suitable for all needs from beginners to advanced developers. Its versatility is reflected in: 1) Easy to learn and use, simple syntax; 2) Rich libraries and frameworks, such as NumPy, Pandas, etc.; 3) Cross-platform support, which can be run on a variety of operating systems; 4) Suitable for scripting and automation tasks to improve work efficiency.

Learning Python in 2 Hours a Day: A Practical GuideApr 17, 2025 am 12:05 AM

Yes, learn Python in two hours a day. 1. Develop a reasonable study plan, 2. Select the right learning resources, 3. Consolidate the knowledge learned through practice. These steps can help you master Python in a short time.

Python vs. C : Pros and Cons for DevelopersApr 17, 2025 am 12:04 AM

Python is suitable for rapid development and data processing, while C is suitable for high performance and underlying control. 1) Python is easy to use, with concise syntax, and is suitable for data science and web development. 2) C has high performance and accurate control, and is often used in gaming and system programming.

Python: Time Commitment and Learning PaceApr 17, 2025 am 12:03 AM

The time required to learn Python varies from person to person, mainly influenced by previous programming experience, learning motivation, learning resources and methods, and learning rhythm. Set realistic learning goals and learn best through practical projects.

Python: Automation, Scripting, and Task ManagementApr 16, 2025 am 12:14 AM

Python excels in automation, scripting, and task management. 1) Automation: File backup is realized through standard libraries such as os and shutil. 2) Script writing: Use the psutil library to monitor system resources. 3) Task management: Use the schedule library to schedule tasks. Python's ease of use and rich library support makes it the preferred tool in these areas.

Python and Time: Making the Most of Your Study TimeApr 14, 2025 am 12:02 AM

To maximize the efficiency of learning Python in a limited time, you can use Python's datetime, time, and schedule modules. 1. The datetime module is used to record and plan learning time. 2. The time module helps to set study and rest time. 3. The schedule module automatically arranges weekly learning tasks.

Python: Games, GUIs, and MoreApr 13, 2025 am 12:14 AM

Python excels in gaming and GUI development. 1) Game development uses Pygame, providing drawing, audio and other functions, which are suitable for creating 2D games. 2) GUI development can choose Tkinter or PyQt. Tkinter is simple and easy to use, PyQt has rich functions and is suitable for professional development.

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Best Graphic Settings

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

2 weeks agoByDDD

R.E.P.O. How to Fix Audio if You Can't Hear Anyone

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Chat Commands and How to Use Them

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

EditPlus Chinese cracked version

Small size, syntax highlighting, does not support code prompt function

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),