search
HomeBackend DevelopmentPython TutorialHow to use Python to make a crawler

How to use Python to make a crawler

Nov 23, 2016 pm 01:23 PM
python

Getting Started" is a good motivation, but it may be slow. If you have a project in your hands or in your mind, then in practice you will be driven by the goal, instead of learning slowly like a learning module.

In addition, if you talk about knowledge If each knowledge point in the system is a point in the graph and the dependency is an edge, then the graph must not be a directed acyclic graph, because the experience of learning A can help you learn B. Therefore, you do not need to learn how. "Getting started", because such a "getting started" point does not exist! What you need to learn is how to make something larger. In the process, you will quickly learn what you need to learn. Of course, you can. The argument is that you need to know python first, otherwise how can you learn python to make a crawler? But in fact, you can learn python in the process of making this crawler :D

I saw the "technique" mentioned in many previous answers - what to use? How does the software crawl? Let me talk about the "Tao" and "Technology" - how the crawler works and how to implement it in python

Let's summarize it briefly:
You need to learn

the basic working principles of the crawler

Basic. http scraping tool, scrapy

Bloom Filter: Bloom Filters by Example

If you need to crawl web pages on a large scale, you need to learn the concept of distributed crawlers. In fact, it is not that mysterious. You only need to learn how to maintain a cluster of machines. The simplest implementation is python-rq: https://github.com/nvie/rq

The combination of rq and Scrapy: darkrho/scrapy-redis · GitHub

Follow-up processing, web page Disjunction (grangier/python-goose · GitHub), storage (Mongodb)


The following is a short story:

Tell me about the experience of climbing down the entire Douban when you wrote a cluster

1) First, you have to do it. Understand how crawlers work.
Imagine you are a spider, and now you are put on the Internet. So, what should you do? No problem, you can just click on it. Start somewhere, for example, the home page of the People's Daily. This is called initial pages, represented by $.

On the home page of the People's Daily, you see various links to that page, so you happily crawled to "domestic". "News" page. Great, now you have finished crawling two pages (homepage and domestic news)! For now, don't worry about how to deal with the page you crawled down. Just imagine that you copied this page completely into an html Put it on you.

Suddenly you find that on the domestic news page, there is a link back to the "Home Page", you must know that you don't have to crawl back, because you have already seen it. Ah. So, you need to use your brain to save the addresses of the pages you have viewed. In this way, every time you see a new link that may need to be crawled, you first check whether you have already visited this page address in your mind. If you've been there, don't go.

Okay, in theory, if all pages can be reached from the initial page, then it can be proved that you can definitely crawl all web pages.

So how to implement it in python?
Very simple
import Queueinitial_page = "http://www.renminribao.com"url_queue = Queue.Queue()seen = set()seen.insert(initial_page)url_queue.put(initial_page)while(True):

#Keep going until everything is dead
if url_queue.size()>0:
current_url = url_queue.get() #Get the first url in the queue
store(current_url) #Store the web page represented by this url _ For next_url in extract_urls (Current_url): #




IF NEXT_URL NORL NORL NORL NORL NOT In SEEN:
Seen.put (next_url)
Url_queue.put (next_url)




Else: a Break
is already written very well Pseudocode.

The backbone of all crawlers is here. Let’s analyze why crawlers are actually a very complicated thing - search engine companies usually have a whole team to maintain and develop them.

2) Efficiency
If you directly process the above code and run it directly, it will take you a whole year to crawl down the entire Douban content. Not to mention that search engines like Google need to crawl down the entire web.

What’s the problem? There are too many web pages that need to be crawled, and the above code is too slow. Assume that there are N websites in the entire network, then analyze the complexity of reuse judgment is N*log(N), because all web pages need to be traversed once, and reusing set every time requires log(N) complexity. OK, OK, I know that python's set implementation is hash - but this is still too slow, at least the memory usage is not efficient.

What is the usual way to determine weight? Bloom Filter. Simply put, it is still a hash method, but its characteristic is that it can use fixed memory (does not grow with the number of URLs) to determine whether the URL is already in the set with O(1) efficiency. Unfortunately, there is no such thing as a free lunch. The only problem is that if the URL is not in the set, BF can be 100% sure that the URL has not been viewed. But if this URL is in the set, it will tell you: This URL should have already appeared, but I have 2% uncertainty. Note that the uncertainty here can become very small when the memory you allocate is large enough. A simple tutorial: Bloom Filters by Example

Notice this feature. If the URL has been viewed, it may be viewed repeatedly with a small probability (it doesn’t matter, you won’t be exhausted if you view it more). But if it has not been viewed, it will definitely be viewed (this is very important, otherwise we will miss some web pages!). [IMPORTANT: There is a problem with this paragraph, please skip it for now]


Okay, now we are close to the fastest way to deal with the weight judgment. Another bottleneck - you only have one machine. No matter how big your bandwidth is, as long as the speed of your machine downloading web pages is the bottleneck, then you can only speed up this speed. If one machine isn't enough - use many! Of course, we assume that each machine has reached maximum efficiency - using multi-threading (for Python, multi-process).

3) Cluster crawling
When crawling Douban, I used a total of more than 100 machines to run around the clock for a month. Imagine if you only use one machine, you will have to run it for 100 months...

So, assuming you have 100 machines available now, how to use python to implement a distributed crawling algorithm?

We call 99 of the 100 machines with smaller computing power slaves, and the other larger machine is called master. Then looking back at the url_queue in the above code, if we can put this queue on this master On the machine, all slaves can communicate with the master through the network. Whenever a slave completes downloading a web page, it requests a new web page from the master to crawl. Every time the slave captures a new web page, it sends all the links on this web page to the master's queue. Similarly, the bloom filter is also placed on the master, but now the master only sends URLs that have not been visited to the slave. The Bloom Filter is placed in the memory of the master, and the visited URL is placed in Redis running on the master, thus ensuring that all operations are O(1). (At least the amortization is O(1). For the access efficiency of Redis, please see: LINSERT – Redis)

Consider how to implement it in python:

Install scrapy on each slave, then each machine will become a capable machine. Get a capable slave and install Redis and rq on the master to use as a distributed queue.


The code is written as

#slave.py
current_url = request_from_master()
to_send = []
for next_url in extract_urls(current_url):
    to_send.append(next_url)
store(current_url);
send_to_master(to_send)
#master.py
distributed_queue = DistributedQueue()
bf = BloomFilter()
initial_pages = "www.renmingribao.com"
while(True):
    if request == 'GET':
        if distributed_queue.size()>0:
            send(distributed_queue.get())
        else:
            break
    elif request == 'POST':
        bf.put(request.url)

Okay, in fact, as you can imagine, someone has already written what you need: darkrho/scrapy-redis · GitHub

4) Outlook and post-processing

Although the above uses a lot of " "Simple", but it is not easy to actually implement a commercial-scale crawler. The above code can be used to crawl an entire website without much problem.

But if you need these follow-up processing, such as 🎜🎜🎜effective storage (how the database should be arranged) 🎜🎜effective duplication judgment (here refers to webpage duplication judgment, we don’t want to compare the People’s Daily and Damin who plagiarized it) Crawled through daily newspapers)🎜

Effective information extraction (such as how to extract all the addresses on the web page, "Zhonghua Road, Fenjin Road, Chaoyang District"), search engines usually do not need to store all the information, such as why should I save pictures...

Timely updates (predict how often this page will be updated)


Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Merging Lists in Python: Choosing the Right MethodMerging Lists in Python: Choosing the Right MethodMay 14, 2025 am 12:11 AM

TomergelistsinPython,youcanusethe operator,extendmethod,listcomprehension,oritertools.chain,eachwithspecificadvantages:1)The operatorissimplebutlessefficientforlargelists;2)extendismemory-efficientbutmodifiestheoriginallist;3)listcomprehensionoffersf

How to concatenate two lists in python 3?How to concatenate two lists in python 3?May 14, 2025 am 12:09 AM

In Python 3, two lists can be connected through a variety of methods: 1) Use operator, which is suitable for small lists, but is inefficient for large lists; 2) Use extend method, which is suitable for large lists, with high memory efficiency, but will modify the original list; 3) Use * operator, which is suitable for merging multiple lists, without modifying the original list; 4) Use itertools.chain, which is suitable for large data sets, with high memory efficiency.

Python concatenate list stringsPython concatenate list stringsMay 14, 2025 am 12:08 AM

Using the join() method is the most efficient way to connect strings from lists in Python. 1) Use the join() method to be efficient and easy to read. 2) The cycle uses operators inefficiently for large lists. 3) The combination of list comprehension and join() is suitable for scenarios that require conversion. 4) The reduce() method is suitable for other types of reductions, but is inefficient for string concatenation. The complete sentence ends.

Python execution, what is that?Python execution, what is that?May 14, 2025 am 12:06 AM

PythonexecutionistheprocessoftransformingPythoncodeintoexecutableinstructions.1)Theinterpreterreadsthecode,convertingitintobytecode,whichthePythonVirtualMachine(PVM)executes.2)TheGlobalInterpreterLock(GIL)managesthreadexecution,potentiallylimitingmul

Python: what are the key featuresPython: what are the key featuresMay 14, 2025 am 12:02 AM

Key features of Python include: 1. The syntax is concise and easy to understand, suitable for beginners; 2. Dynamic type system, improving development speed; 3. Rich standard library, supporting multiple tasks; 4. Strong community and ecosystem, providing extensive support; 5. Interpretation, suitable for scripting and rapid prototyping; 6. Multi-paradigm support, suitable for various programming styles.

Python: compiler or Interpreter?Python: compiler or Interpreter?May 13, 2025 am 12:10 AM

Python is an interpreted language, but it also includes the compilation process. 1) Python code is first compiled into bytecode. 2) Bytecode is interpreted and executed by Python virtual machine. 3) This hybrid mechanism makes Python both flexible and efficient, but not as fast as a fully compiled language.

Python For Loop vs While Loop: When to Use Which?Python For Loop vs While Loop: When to Use Which?May 13, 2025 am 12:07 AM

Useaforloopwheniteratingoverasequenceorforaspecificnumberoftimes;useawhileloopwhencontinuinguntilaconditionismet.Forloopsareidealforknownsequences,whilewhileloopssuitsituationswithundeterminediterations.

Python loops: The most common errorsPython loops: The most common errorsMay 13, 2025 am 12:07 AM

Pythonloopscanleadtoerrorslikeinfiniteloops,modifyinglistsduringiteration,off-by-oneerrors,zero-indexingissues,andnestedloopinefficiencies.Toavoidthese:1)Use'i

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

SAP NetWeaver Server Adapter for Eclipse

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.

MinGW - Minimalist GNU for Windows

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

ZendStudio 13.5.1 Mac

ZendStudio 13.5.1 Mac

Powerful PHP integrated development environment

mPDF

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),