How to improve the parsing efficiency of python crawler?

Question

What is currently used is to crawl in a multi-threaded manner under the windows environment, using beautifulsoup+lxml for parsing. N crawling threads-&gt;parsing queue-&gt;1 parsing thread-&gt;storage queue-&gt;1 The efficiency of the entire execution of the storage thread is stuck in the computationally intensive parsing thread...

为情所困 · Answer

In fact, I think that the N crawling threads you have in front of you can be replaced by coroutine/thread pool, because you are saving a performance cost by frequently creating threads. Although using a thread pool can reduce this part of the loss, But context switching is still unavoidable, so coroutines should be more appropriate.
1 parsing thread is replaced by process pool, and a few more processes are opened for computationally intensive processing. The rest should not need to be changed. If I still want to do it again and rewrite the core part in c/c++. I hope it can help you

怪我咯 · Answer

My approach is multi-process. The advantage of multi-process is that when the performance of a single machine is not enough, you can switch to a distributed crawler at any time.

淡淡烟草味 · Answer

You can find tornade asynchronous crawler online, I am using this

How to improve the parsing efficiency of python crawler?

reply all(3)I'll reply