Home  >  Article  >  Backend Development  >  What an artifact! An efficient Python crawler framework that is easier to use than requests!

What an artifact! An efficient Python crawler framework that is easier to use than requests!

WBOY
WBOYforward
2023-04-13 14:25:031434browse

What an artifact! An efficient Python crawler framework that is easier to use than requests!

Recently, the company's Python back-end project has been restructured, and the entire back-end logic has basically been changed to use "asynchronous" coroutines. Looking at the screen full of code decorated with async await (the implementation of coroutines in Python), I suddenly felt confused and at a loss.

Although I have learned about what a "coroutine" is before, I have not explored it in depth, so I just took this opportunity to learn about it.

Let's go

What an artifact! An efficient Python crawler framework that is easier to use than requests!

What is a coroutine?

Simply put, coroutines are based on threads but are more lightweight than threads. For the system kernel, coroutines have invisible characteristics, so this lightweight thread that is managed by programmers writing their own programs is often called "user space thread".

What are the advantages of coroutines over multi-threads?

1. The control of threads is in the hands of the operating system, while the control of coroutines is completely in the hands of the user. Therefore, using coroutines can reduce context switching during program running and effectively improve program running efficiency.

2. When creating a thread, the default stack size allocated to the thread by the system is 1 M, while the coroutine is lighter, close to 1 K, so more coroutines can be opened in the same memory.

3. Since the nature of a coroutine is not multi-threaded but single-threaded, there is no need for a multi-threaded lock mechanism. Because there is only one thread, there is no conflict caused by writing variables at the same time. Controlling shared resources in a coroutine does not require locking, only the status needs to be determined. Therefore, the execution efficiency of coroutines is much higher than that of multi-threads, and it also effectively avoids competition in multi-threads.

Applicable & Inapplicable Scenarios of Coroutines

Applicable Scenarios: Coroutines are suitable for scenarios that are blocked and require a large amount of concurrency.

Unapplicable scenarios: Coroutines are not suitable for scenarios with a large amount of calculations (because the essence of coroutines is to switch back and forth in a single thread). If you encounter this situation, you should still use other means to solve it.

Initial exploration of the asynchronous http framework httpx

At this point we should have a general understanding of "coroutine", but at this point in the story, I believe some friends are still full of doubts: "coroutine" How does it help with interface testing? Don't worry, the answer is below.

I believe that friends who have used Python for interface testing are familiar with the requests library. The http request implemented in requests is a synchronous request, but in fact, based on the IO blocking characteristics of http requests, it is very suitable to use coroutines to implement "asynchronous" http requests to improve testing efficiency.

I believe someone has noticed this a long time ago, so after some exploration on Github, as expected, I finally found an open source library that supports coroutine "asynchronous" calling http: httpx.

What is httpx

httpx is an open source library that inherits almost all the features of requests and supports "asynchronous" http requests. To put it simply, httpx can be considered an enhanced version of requests.

Now everyone can follow me to see the power of httpx.

Installation

The installation of httpx is very simple and can be executed in an environment of Python 3.6 or above.

pip install httpx

Best Practice

As the saying goes, efficiency determines success or failure. I used httpx asynchronous and synchronous methods to compare the time consumption of batch http requests. Let’s take a look at the results~

First, let’s take a look at the time-consuming performance of synchronous http requests:

import asyncio
import httpx
import threading
import time
def sync_main(url, sign):
 response = httpx.get(url).status_code
 print(f'sync_main: {threading.current_thread()}: {sign}2 + 1{response}')
sync_start = time.time()
[sync_main(url='http://www.baidu.com', sign=i) for i in range(200)]
sync_end = time.time()
print(sync_end - sync_start)

The code is relatively simple. You can see that in sync_main, synchronous http access to Baidu is implemented 200 times.

The output after running is as follows (part of the key output is intercepted...):

sync_main: <_MainThread(MainThread, started 4471512512)>: 192: 200
sync_main: <_MainThread(MainThread, started 4471512512)>: 193: 200
sync_main: <_MainThread(MainThread, started 4471512512)>: 194: 200
sync_main: <_MainThread(MainThread, started 4471512512)>: 195: 200
sync_main: <_MainThread(MainThread, started 4471512512)>: 196: 200
sync_main: <_MainThread(MainThread, started 4471512512)>: 197: 200
sync_main: <_MainThread(MainThread, started 4471512512)>: 198: 200
sync_main: <_MainThread(MainThread, started 4471512512)>: 199: 200
16.56578803062439

You can see that in the above output, the main thread is not switched (because it is originally a single thread) !) The requests are executed in order (because they are synchronous requests).

The program took a total of 16.6 seconds to run.

Let's try "asynchronous" http request below:

import asyncio
import httpx
import threading
import time
client = httpx.AsyncClient()
async def async_main(url, sign):
 response = await client.get(url)
 status_code = response.status_code
 print(f'async_main: {threading.current_thread()}: {sign}:{status_code}')
loop = asyncio.get_event_loop()
tasks = [async_main(url='http://www.baidu.com', sign=i) for i in range(200)]
async_start = time.time()
loop.run_until_complete(asyncio.wait(tasks))
async_end = time.time()
loop.close()
print(async_end - async_start)

The above code uses the async await keyword in async_main to implement "asynchronous" http, and requests Baidu homepage 200 through asyncio (asynchronous io library times and print out the time taken).

After running the code, you can see the following output (some key output has been intercepted...).

async_main: <_MainThread(MainThread, started 4471512512)>: 56: 200
async_main: <_MainThread(MainThread, started 4471512512)>: 99: 200
async_main: <_MainThread(MainThread, started 4471512512)>: 67: 200
async_main: <_MainThread(MainThread, started 4471512512)>: 93: 200
async_main: <_MainThread(MainThread, started 4471512512)>: 125: 200
async_main: <_MainThread(MainThread, started 4471512512)>: 193: 200
async_main: <_MainThread(MainThread, started 4471512512)>: 100: 200
4.518340110778809

You can see that although the order is messed up (56, 99, 67...) (this is because the program keeps switching between coroutines), the main thread does not switch (the essence of the coroutine is still single thread).

The procedure took a total of 4.5 seconds.

Compared with the 16.6 seconds that a synchronous request takes, it is shortened by nearly 73%!

As the saying goes, one step is fast, and every step is fast. In terms of time consumption, "asynchronous" httpx is indeed much faster than synchronous http. Of course, "coroutines" can not only enable interface testing in terms of request efficiency. After mastering "coroutines", I believe that the technical level of friends can also be improved to a higher level, thereby designing a better testing framework.

Okay, this is all the content shared today, if you like it, please give it a like~

The above is the detailed content of What an artifact! An efficient Python crawler framework that is easier to use than requests!. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete