Home  >  Q&A  >  body text

How to improve the parsing efficiency of python crawler?

Now we use multi-threaded crawling in the windows environment,
use beautifulsoup lxml for parsing.

N crawling threads->parsing queue->1 parsing thread->storage queue->1 storage thread

The efficiency of the entire execution program is stuck in the computationally intensive parsing threads. If you only increase the number of parsing threads, it will increase the thread switching overhead and slow down the speed.

Is there any way to significantly improve the parsing efficiency?

According to the instructions of the two thighs, prepare to use
Asynchronous crawling->Parsing queue->N parsing processes->Storage queue->Storage thread

Ready to start work

世界只因有你世界只因有你2708 days ago740

reply all(3)I'll reply

  • 为情所困

    为情所困2017-06-12 09:22:36

    In fact, I think that the N crawling threads you have in front of you can be replaced by coroutine/thread pool, because you are saving a performance cost by frequently creating threads. Although using a thread pool can reduce this part of the loss, But context switching is still unavoidable, so coroutines should be more appropriate.
    1 parsing thread is replaced by process pool, and a few more processes are opened for computationally intensive processing. The rest should not need to be changed. If I still want to do it again and rewrite the core part in c/c++. I hope it can help you

    reply
    0
  • 怪我咯

    怪我咯2017-06-12 09:22:36

    My approach is multi-process. The advantage of multi-process is that when the performance of a single machine is not enough, you can switch to a distributed crawler at any time.

    reply
    0
  • 淡淡烟草味

    淡淡烟草味2017-06-12 09:22:36

    You can find tornade asynchronous crawler online, I am using this

    reply
    0
  • Cancelreply