Home > Article > Backend Development > Summarize the Python example explanation of network IO model and select model
Network I/O Model
If there are too many people, there will be problems. When the web first appeared, very few people visited it. In recent years, the scale of network applications has gradually expanded, and the application architecture also needs to change accordingly. The problem of C10k requires engineers to think about the performance of services and the concurrency capabilities of applications.
Network applications need to deal with nothing more than two major types of problems, network I/O and data calculation. Compared with the latter, the delay of network I/O brings a greater performance bottleneck to the application than the latter. The models of network I/O are roughly as follows:
The essence of network I/O is the reading of socket. Socket is abstracted as a stream in the Linux system, and I/O can be understood as an operation on the stream. This operation is divided into two stages:
Waiting for the data to be ready.
Copying the data from the kernel to the process.
For socket streaming only,
The first step usually involves waiting for a data packet to arrive on the network and then be copied to some buffer in the kernel.
The second step is to copy the data from the kernel buffer to the application process buffer.
I/O model:
Let’s use a simple metaphor to understand these models. Network IO is like fishing. Waiting for the fish to take the bait is the process of waiting for the data to be ready in the network. Once the fish takes the bait, pulling the fish ashore is the stage of the kernel copying data. The person who fishes is an application process.
Blocking I/O
Blocking I/O is the most popular I/O model. It conforms to people's most common thinking logic. Blocking means that the process is "rested" and the CPU is busy processing other processes. During network I/O, the process initiates the recvform system call, and then the process is blocked and does nothing until the data is ready, and the data is copied from the kernel to the user process. Finally, the process processes the data and waits for the data. By the two stages of processing the data, the entire process is blocked. Cannot handle other network I/O. Roughly as shown below:
This is like when we go fishing, after casting the rod, we wait on the shore until the fish takes the bait. Then throw the rod again and wait for the next fish to bite. While waiting, do nothing and probably think randomly.
The characteristic of blocking IO is that it is blocked in both stages of IO execution
Non-blocking I/O (non-bloking I/O)
During network I/O, non-blocking I/O will also make a recvform system call to check whether the data is ready. Unlike blocking I/O, "non-blocking divides the large block time into N more Small blocking, so the process constantly has the opportunity to be 'patrolled' by the CPU."
That is to say, after the non-blocking recvform system call is called, the process is not blocked, and the kernel returns to the process immediately. If the data is not ready, an error will be returned at this time. After the process returns, it can do something else and then initiate the recvform system call. Repeat the above process and make recvform system calls in a cycle. This process is often called polling. Polling checks the kernel data until the data is ready, then copies the data to the process for data processing. It should be noted that during the entire process of copying data, the process is still blocked.
Let’s use the fishing method to classify. When we throw the rod into the water, we will check whether there is any movement on the fish float. If no fish takes the bait, we will do something else, such as digging up a few more earthworms. Then he came back soon to see if any fish were biting the float. This back and forth check and leave until the fish is hooked and then processed.
The characteristic of non-blocking IO is that the user process needs to constantly actively ask the kernel whether the data is ready.
Multiplexing I/O
It can be seen that due to non-blocking calls, polling takes up a large part of the process, and polling consumes a lot of CPU time. Combine the previous two modes. It would be nice if polling was not the user mode of the process, but someone helped. Multiplexing deals with exactly this problem.
Multiplexing has two special system calls select or poll. The select call is at the kernel level. The difference between select polling and non-blocking polling is that the former can wait for multiple sockets. When the data of any one of the sockets is ready, it can return to be readable, and then the process can Make a recvform system call to copy data from the kernel to the user process. Of course, this process is blocked. There are two kinds of blocking in multiplexing. After the select or poll is called, the process will be blocked. The difference from the first blocking is that the select at this time does not wait until all the socket data arrives before processing, but will call the user when a part of the data arrives. process to handle. How do you know that some data has arrived? The monitoring is handed over to the kernel, which is responsible for processing the data arrival. It can also be understood as "non-blocking".
For multiplexing, that is, polling multiple sockets. When fishing, we hired a helper who could throw multiple fishing rods at the same time. As soon as a fish on any rod was hooked, he would pull the rod. He is only responsible for helping us fish and will not help us deal with it, so we have to wait together for him to close the pole. Let's deal with the fish again. Since multiplexing can handle multiple I/Os, it also brings new problems. The order between multiple I/Os becomes uncertain, and of course it can also target different numbers.
The characteristic of multiplexing is that through a mechanism, a process can wait for IO file descriptors at the same time. The kernel monitors these file descriptors (socket descriptors), and any one of them enters the read ready state, select, poll , the epoll function can return. As for the monitoring methods, they can be divided into three methods: select, poll, and epoll.
Understand the previous three modes. When the user process makes a system call, they process it in different ways when waiting for data to arrive. They wait directly, poll, select or poll. The first process is sometimes blocked. Some are non-blocking, and some can be blocked or not. At that time, the second process was blocked. From the perspective of the entire I/O process, they are executed sequentially, so they can be classified as a synchronous model (asynchronous). All processes actively check with the kernel.
Asynchronous I/O
Compared with synchronous I/O, asynchronous I/O is not executed sequentially. After the user process makes the aio_read system call, regardless of whether the kernel data is ready or not, it will be returned directly to the user process, and then the user state process can do other things. When the socket data is ready, the kernel directly copies the data to the process, and then sends a notification from the kernel to the process. In both I/O phases, the process is non-blocking.
The fishing method is different from the previous one. This time we hired a fishing expert. Not only does he fish, but he also texts us after the fish is hooked to let us know the fish is ready. We just delegate to him to cast the rod and then run off to do other things until he texts us. We come back to deal with the fish that have been landed.
The difference between synchronous and asynchronous
Through the discussion of the above models, it is necessary to distinguish between blocking and non-blocking, synchronous and asynchronous. They are actually two sets of concepts. It is easier to distinguish the former group, while the latter group is often easily mixed with the former. In my opinion, the so-called synchronization is in the entire I/O process. In particular, the process of copying data blocks the process, and the application process state checks the kernel state. Asynchronous means that the user process of the entire I/O process is non-blocking, and when data is copied, the kernel sends a notification to the user process.
For the synchronous model, the main processing method in the first stage is different. In the asynchronous model, both phases are different. Here we ignore the signal drive mode. These terms are still confusing. Only the synchronous model considers blocking and non-blocking, because asynchronous is definitely non-blocking, and the term asynchronous and non-blocking feels superfluous.
Select model
In the synchronization model, using multiplexed I/O can improve server performance.
Among the multiplexing models, the select model and the poll model are more commonly used. Both of these are system interfaces, provided by the operating system. Of course, Python's select module provides more advanced encapsulation. The underlying principles of select and poll are similar. The long-awaited model comes out, and the focus of this article is to select the model.
1.select principle
Network communication is abstracted by the Unix system as the reading and writing of files, usually a device, provided by the device driver, and the driver can know whether its own data is available. Device drivers that support blocking operations usually implement a set of their own waiting queues, such as read/write waiting queues to support block or non-block operations required by the upper layer (user layer). If the device's file resources are available (readable or writable), the process will be notified. Otherwise, the process will be put to sleep, and the process will be awakened when the data is available.
The file descriptors of these devices are placed in an array, and then the array is traversed when the select call is made. If the file descriptor is readable, the changed file descriptor will be returned. After the traversal is completed, if there is still no available device file descriptor, select will let the user process sleep until it wakes up when waiting for resources to become available, and traverse the previously monitored array. Each traversal is linear.
2.select echo server
select involves system calls and operating system-related knowledge, so it is relatively boring to understand its principles literally. There's nothing better than demonstrating it in code. It is easy to write the following echo server using python's select module:
import select import socket import sys HOST = 'localhost' PORT = 5000 BUFFER_SIZE = 1024 server = socket.socket(socket.AF_INET, socket.SOCK_STREAM) server.bind((HOST, PORT)) server.listen(5) inputs = [server, sys.stdin] running = True while True: try: # 调用 select 函数,阻塞等待 readable, writeable, exceptional = select.select(inputs, [], []) except select.error, e: break # 数据抵达,循环 for sock in readable: # 建立连接 if sock == server: conn, addr = server.accept() # select 监听的socket inputs.append(conn) elif sock == sys.stdin: junk = sys.stdin.readlines() running = False else: try: # 读取客户端连接发送的数据 data = sock.recv(BUFFER_SIZE) if data: sock.send(data) if data.endswith('\r\n\r\n'): # 移除select监听的socket inputs.remove(sock) sock.close() else: # 移除select监听的socket inputs.remove(sock) sock.close() except socket.error, e: inputs.remove(sock) server.close()
Run the above code, use curl to access http://localhost:5000, and you can see the requested HTTP request information returned by the command line.
The principle of the above code is analyzed in detail below.
server = socket.socket(socket.AF_INET, socket.SOCK_STREAM) server.bind((HOST, PORT)) server.listen(5)
The above code uses socket to initialize a TCP socket, bind the host address and port, and then set up the server to listen.
inputs = [server, sys.stdin]
Here defines a list that needs to be selected for monitoring. The list contains the objects that need to be monitored (equal to the file descriptors monitored by the system). This listens to the socket and user input.
Then the code performs a wireless loop on the server.
try: # 调用 select 函数,阻塞等待 readable, writeable, exceptional = select.select(inputs, [], []) except select.error, e: break
The select function is called and begins to loop through the incoming list inputs. If there is no curl server, no TCP client connection is established at this time, so the objects in the list are all data resources and are unavailable. Therefore, select blocks and does not return.
After the client enters curl http://localhost:5000, a socket communication starts. At this time, the first object server in the input changes from unavailable to available. Therefore, the select function call returns, and the readable at this time has a socket object (the file descriptor is readable).
for sock in readable: # 建立连接 if sock == server: conn, addr = server.accept() # select 监听的socket inputs.append(conn)
After select returns, the readable file object is traversed. At this time, there is only one socket connection. The accept() method of the socket is called to establish a TCP three-way handshake connection, and then the connection object is Appending to the inputs monitoring list means that we want to monitor whether there are data IO operations on the connection.
Since there is only one available object in readable at this time, the traversal ends. Go back to the main loop and call select again. When called at this time, it will not only traverse and monitor whether there are new connections that need to be established, but also monitor the newly added connections. If the curl data arrives, select returns to readable, and a for loop is performed at this time. If there is no new socket, the following code will be executed:
try: # 读取客户端连接发送的数据 data = sock.recv(BUFFER_SIZE) if data: sock.send(data) if data.endswith('\r\n\r\n'): # 移除select监听的socket inputs.remove(sock) sock.close() else: # 移除select监听的socket inputs.remove(sock) sock.close() except socket.error, e: inputs.remove(sock)
Call the recv function through the socket connection to obtain the data sent by the client. When the data transmission is completed, the connection is removed from the monitored inputs list. Then close the connection.
The entire network interaction process is like this. Of course, if the user interrupts input on the command line, the sys.stdin monitored in the inputs list will also cause select to return, and finally the following code will be executed:
elif sock == sys.stdin: junk = sys.stdin.readlines() running = False
Some people may have questions. When the program is processing the sock connection, what will happen if curl is entered to request the server? At this point there is no doubt that the server socket in inputs will become available. When the current for loop is completed, the select call will return to the server. If there is a conn connection from the previous process in the inputs, it will also loop through the inputs, monitor the new socket accept to the inputs list again, and then continue to loop through the previous conn connection. This proceeds in an orderly manner until the for loop ends and the main loop is entered to call select.
Any time the object monitored by inputs has data, the next time select is called, readable will be returned. As long as it returns, a for loop will be performed on readable until the for loop ends before the next select is performed.
Mainly note that the establishment of a socket connection is an IO, and the arrival of the connected data is also an IO.
3.Select’s shortcomings
Although select is very fun to use, it has cross-platform features. But there are still some problems with select.
Select needs to traverse the monitored file descriptors, and the array of this descriptor has a maximum limit. As the number of file descriptors increases, the overhead caused by copying the address space of user space and the kernel will also increase linearly. Even if the monitored file descriptor has been inactive for a long time, select will still scan linearly.
In order to solve these problems, the operating system provides a poll solution, but the poll model is roughly the same as select, with only some restrictions changed. Currently, the most advanced method in Linux is the epoll model.
Many high-performance software such as nginx and nodejs are asynchronous based on epoll.