Home  >  Article  >  Backend Development  >  Nginx thread pool and performance analysis

Nginx thread pool and performance analysis

WBOY
WBOYOriginal
2016-08-08 09:19:431069browse
As we know, NGINX adopts an asynchronous, event-driven approach to handling connections. This approach eliminates the need to create additional dedicated processes or threads for each request (as with servers using traditional architecture). Instead, multiple connections and requests are handled in a single worker process. To this end, NGINX works in non-blocking socket mode and uses effective methods such as epoll and kqueue. Because the number of fully loaded processes is small (usually only one per CPU core) and constant, task switching consumes very little memory and does not waste CPU cycles. The advantages of this approach are already well known through the use of NGINX itself. NGINX can handle millions of concurrent requests very well.
Each process consumes additional memory, and each switch between processes consumes CPU cycles and discards data in the CPU cache.
However, there are still problems with asynchronous, event-driven approaches. Or, as I like to call this problem, the enemy soldier, the enemy soldier’s name is blocking. Unfortunately, many third-party modules use blocking calls, and users (and sometimes even the module developers) are unaware of the disadvantages of blocking. Blocking operations can ruin NGINX performance and must be avoided at all costs. Even in the current official NGINX code, it is still impossible to avoid blocking in all scenarios. The thread pool mechanism implemented in NGINX 1.7.11 solves this problem. We will describe what this thread pool is and how to use it later. Now, let's have a face-to-face collision with our "enemy soldiers". 2. QuestionFirst of all, in order to better understand this problem, let’s explain how NGINX works in a few sentences. Normally, NGINX is an event handler, that is, a controller that receives information about all connection events from the kernel and then issues instructions to the operating system on what to do. In effect, NGINX does all the dirty work of orchestrating the operating system, which does the day-to-day work of reading and sending bytes. Therefore, for NGINX, fast and timely response is very important.
The worker process listens and processes events from the kernel
Events can be notifications of timeouts, socket read and write readiness, or notifications of errors. NGINX receives a large number of events and then processes them one after another and performs the necessary operations. Therefore, all processing is done in a simple loop through a queue in a thread. NGINX takes an event from the queue and responds to it, such as reading and writing a socket. In most cases, this approach is very fast (perhaps just requiring a few CPU cycles to copy some data into memory), and NGINX can process all events in the queue in a flash.

All processing is done in a simple loop by one thread
However, if the operations to be processed by NGINX are some long and heavy What will happen if the operation is performed? The entire event processing loop will be stuck waiting for this operation to complete. Thus, a "blocking operation" is any operation that causes the event processing loop to stop significantly for a period of time. Operations can become blocking operations for various reasons. For example, NGINX may be busy with long, CPU-intensive processing, or it may be waiting to access a resource (such as the hard disk, or a mutex, or to obtain the corresponding library function call from the database in synchronous mode, etc.). The point is that while processing such an operation, the worker process cannot do other things or process other events, even if there are more available system resources that can be utilized by some events in the queue. Let’s use an analogy. A store clerk has to deal with a long line of customers in front of him. The first customer in the line wanted an item that was not in the store but in a warehouse. The salesperson ran to the warehouse to get the items. Now the whole team has to wait for hours for this kind of distribution, and everyone in the team is unhappy. You can imagine people's reactions, right? These times are added to the wait time for everyone in the line, unless the item they want to buy is in the store.
Everyone in the queue has to wait for the first person to buy
Almost the same situation happens in NGINX, like when reading a file, if the file If it is not cached in memory, it must be read from disk. Reading from a disk (especially a spinning disk) is slow, and other requests waiting in the queue are forced to wait when they may not need access to the disk. As a result, latency increases and system resources are underutilized.

One blocking operation is enough to significantly delay all subsequent operations
Some operating systems provide for reading and writing files Asynchronous interface, NGINX Such an interface can be used (see AIO instructions). FreeBSD is a good example. Unfortunately, we don't get the same benefits on Linux. Although Linux provides an asynchronous interface for reading files, there are obvious shortcomings. One of them is that file access and buffering are required to be aligned, but NGINX handles this very well. However, the other drawback is even worse. The asynchronous interface requires the O_DIRECT flag to be set in the file descriptor, which means that any access to the file will bypass the in-memory cache, which increases the load on the disk. There are many scenarios where this is definitely not the best option. In order to solve this problem in a targeted manner, the thread pool was introduced in NGINX 1.7.11. By default, NGINX+ does not include thread pools yet, but if you want to give it a try, you can contact sales. NGINX+ R6 is a build that has thread pools enabled. Now, let’s step into the thread pool and see what it is and how it works. 3. Thread PoolLet’s go back to the poor salesperson who has to go all the way from the warehouse to distribute the goods. This time, he's gotten smarter (or maybe he got smarter after a lecture from a bunch of angry customers?) and hired a fulfillment service team. Now, when anyone wants to buy something that is in a warehouse all the way away, he no longer has to go to the warehouse in person, he just needs to drop the order to the fulfillment service and they will process the order. At the same time, our salesperson can still continue to serve other customers. Serve. Therefore, only those customers who want to buy items from the warehouse need to wait for distribution, while other customers can receive immediate service.
Transmitting orders to the distribution service will not block the queue
For NGINX, the thread pool performs the function of the distribution service. It consists of a task queue and a set of threads that process this queue. When the worker process needs to perform a potentially long operation, the worker process no longer performs the operation itself, but puts the task into the thread pool queue. Any idle thread can obtain and execute the task from the queue.

The worker process offloads the blocking operation to the thread pool

So, it’s like we have another queue. Yes, but in this scenario, the queue is limited to special resources. A disk cannot be read faster than the disk can produce data. Anyway, at least now the disk no longer delays other events, only requests to access files need to wait. The operation "reading from disk" is usually the most common example of a blocking operation, but in fact, the thread pool implemented in NGINX can be used to handle any task that does not fit in the main loop. Currently, the two basic operations offloaded to the thread pool are the read() system call in most operating systems and sendfile() in Linux. Next, we will test and benchmark the thread pool. In future versions, if there are obvious advantages, we may offload other operations to the thread pool. 4. BenchmarkingNow let’s move from theory to practice. We will conduct a synthetic benchmark that simulates the effects of using a thread pool under a worst-case mix of blocking and non-blocking operations. In addition, we need a data set that will definitely not fit in the memory. On a machine with 48GB of RAM, we have generated random data with a file size of 4MB each, for a total of 256GB, and then configured NGINX with version 1.9.0. The configuration is simple: worker_processes 16; events {     accept_mutex off; } http {     include mime.types;     default_type application/octet-stream;     access_log off;     sendfile on;     sendfile_max_chunk 512k;     server {         listen 8000;         location / {             root /storage;         }     } }As shown above, in order to achieve better performance, we adjusted several parameters: disabled logging and accept_mutex, at the same time, enabled sendfile and set the size of sendfile_max_chunk. This last directive reduces the maximum time spent on blocking calls to sendfile() because NGINX doesn't try to send the entire file out at once, but instead sends the data in chunks of 512KB each time. This test server has 2 Intel Xeon E5645 processors (total: 12 cores, 24 hyper-threads) and a 10-Gbps network interface. The disk subsystem is composed of 4 Western Digital WD1003FBYX A RAID10 array of disks. All this hardware is powered by Ubuntu Server 14.04.1 LTS.

Configuring load generator and NGINX for benchmarking

The client has 2 servers and they have the same specs. On one of them, the load program was created using a Lua script in wrk. The script uses 200 parallel connections to request the file from the server, and each request may miss the cache and block the read from disk. We call this load random load. 在另一台客户端机器上,我们将运行wrk的另一个副本,使用50个并行连接多次请求同一个文件。因为这个文件将被频繁地访问,所以它会一直驻留在内存中。在正常情况下,NGINX能够非常快速地服务这些请求,但是如果工作进程被其他请求阻塞的话,性能将会下降。我们将这种负载称作恒定负载。性能将由服务器上ifstat监测的吞吐率(throughput)和从第二台客户端获取的wrk结果来度量。现在,没有使用线程池的第一次运行将不会带给我们非常振奋的结果:% ifstat -bi eth2 eth2 Kbps in  Kbps out 5531.24  1.03e+06 4855.23  812922.7 5994.66  1.07e+06 5476.27  981529.3 6353.62  1.12e+06 5166.17  892770.3 5522.81  978540.8 6208.10  985466.7 6370.79  1.12e+06 6123.33  1.07e+06如上所示,使用这种配置,服务器产生的总流量约为1Gbps。从下面所示的top输出,我们可以看到,工作进程的大部分时间花在阻塞I/O上(它们处于top的D状态):top - 10:40:47 up 11 days,  1:32,  1 user,  load average: 49.61, 45.77 62.89 Tasks: 375 total,  2 running, 373 sleeping,  0 stopped,  0 zombie %Cpu(s):  0.0 us,  0.3 sy,  0.0 ni, 67.7 id, 31.9 wa,  0.0 hi,  0.0 si,  0.0 st KiB Mem:  49453440 total, 49149308 used,   304132 free,    98780 buffers KiB Swap: 10474236 total,    20124 used, 10454112 free, 46903412 cached Mem   PID USER     PR  NI    VIRT    RES     SHR S  %CPU %MEM    TIME+ COMMAND  4639 vbart    20   0   47180  28152     496 D   0.7  0.1  0:00.17 nginx  4632 vbart    20   0   47180  28196     536 D   0.3  0.1  0:00.11 nginx  4633 vbart    20   0   47180  28324     540 D   0.3  0.1  0:00.11 nginx  4635 vbart    20   0   47180  28136     480 D   0.3  0.1  0:00.12 nginx  4636 vbart    20   0   47180  28208     536 D   0.3  0.1  0:00.14 nginx  4637 vbart    20   0   47180  28208     536 D   0.3  0.1  0:00.10 nginx  4638 vbart    20   0   47180  28204     536 D   0.3  0.1  0:00.12 nginx  4640 vbart    20   0   47180  28324     540 D   0.3  0.1  0:00.13 nginx  4641 vbart    20   0   47180  28324     540 D   0.3  0.1  0:00.13 nginx  4642 vbart    20   0   47180  28208     536 D   0.3  0.1  0:00.11 nginx  4643 vbart    20   0   47180  28276     536 D   0.3  0.1  0:00.29 nginx  4644 vbart    20   0   47180  28204     536 D   0.3  0.1  0:00.11 nginx  4645 vbart    20   0   47180  28204     536 D   0.3  0.1  0:00.17 nginx  4646 vbart    20   0   47180  28204     536 D   0.3  0.1  0:00.12 nginx  4647 vbart    20   0   47180  28208     532 D   0.3  0.1  0:00.17 nginx  4631 vbart    20   0   47180    756     252 S   0.0  0.1  0:00.00 nginx  4634 vbart    20   0   47180  28208     536 D   0.0  0.1  0:00.11 nginx  4648 vbart    20   0   25232   1956    1160 R   0.0  0.0  0:00.08 top 25921 vbart    20   0  121956   2232    1056 S   0.0  0.0  0:01.97 sshd 25923 vbart    20   0   40304   4160    2208 S   0.0  0.0  0:00.53 zsh在这种情况下,吞吐率受限于磁盘子系统,而CPU在大部分时间里是空闲的。从wrk获得的结果也非常低:Running 1m test @ http://192.0.2.1:8000/1/1/1   12 threads and 50 connections   Thread Stats   Avg    Stdev     Max  +/- Stdev     Latency     7.42s  5.31s   24.41s   74.73%     Req/Sec     0.15    0.36     1.00    84.62%   488 requests in 1.01m, 2.01GB read Requests/sec:      8.08 Transfer/sec:     34.07MB请记住,文件是从内存送达的!第一个客户端的200个连接创建的随机负载,使服务器端的全部的工作进程忙于从磁盘读取文件,因此产生了过大的延迟,并且无法在合理的时间内处理我们的请求。现在,我们的线程池要登场了。为此,我们只需在location块中添加aio threads指令:location / {     root /storage;     aio threads; }接着,执行NGINX reload重新加载配置。然后,我们重复上述的测试:% ifstat -bi eth2 eth2 Kbps in  Kbps out 60915.19  9.51e+06 59978.89  9.51e+06 60122.38  9.51e+06 61179.06  9.51e+06 61798.40  9.51e+06 57072.97  9.50e+06 56072.61  9.51e+06 61279.63  9.51e+06 61243.54  9.51e+06 59632.50  9.50e+06现在,我们的服务器产生的流量是9.5Gbps,相比之下,没有使用线程池时只有约1Gbps!理论上还可以产生更多的流量,但是这已经达到了机器的最大网络吞吐能力,所以在这次NGINX的测试中,NGINX受限于网络接口。工作进程的大部分时间只是休眠和等待新的时间(它们处于top的S状态):top - 10:43:17 up 11 days,  1:35,  1 user,  load average: 172.71, 93.84, 77.90 Tasks: 376 total,  1 running, 375 sleeping,  0 stopped,  0 zombie %Cpu(s):  0.2 us,  1.2 sy,  0.0 ni, 34.8 id, 61.5 wa,  0.0 hi,  2.3 si,  0.0 st KiB Mem:  49453440 total, 49096836 used,   356604 free,    97236 buffers KiB Swap: 10474236 total,    22860 used, 10451376 free, 46836580 cached Mem   PID USER     PR  NI    VIRT    RES     SHR S  %CPU %MEM    TIME+ COMMAND  4654 vbart    20   0  309708  28844     596 S   9.0  0.1  0:08.65 nginx  4660 vbart    20   0  309748  28920     596 S   6.6  0.1  0:14.82 nginx  4658 vbart    20   0  309452  28424     520 S   4.3  0.1  0:01.40 nginx  4663 vbart    20   0  309452  28476     572 S   4.3  0.1  0:01.32 nginx  4667 vbart    20   0  309584  28712     588 S   3.7  0.1  0:05.19 nginx  4656 vbart    20   0  309452  28476     572 S   3.3  0.1  0:01.84 nginx  4664 vbart    20   0  309452  28428     524 S   3.3  0.1  0:01.29 nginx  4652 vbart    20   0  309452  28476     572 S   3.0  0.1  0:01.46 nginx  4662 vbart    20   0  309552  28700     596 S   2.7  0.1  0:05.92 nginx  4661 vbart    20   0  309464  28636     596 S   2.3  0.1  0:01.59 nginx  4653 vbart    20   0  309452  28476     572 S   1.7  0.1  0:01.70 nginx  4666 vbart    20   0  309452  28428     524 S   1.3  0.1  0:01.63 nginx  4657 vbart    20   0  309584  28696     592 S   1.0  0.1  0:00.64 nginx  4655 vbart    20   0  30958   28476     572 S   0.7  0.1  0:02.81 nginx  4659 vbart    20   0  309452  28468     564 S   0.3  0.1  0:01.20 nginx  4665 vbart    20   0  309452  28476     572 S   0.3  0.1  0:00.71 nginx  5180 vbart    20   0   25232   1952    1156 R   0.0  0.0  0:00.45 top  4651 vbart    20   0   20032    752     252 S   0.0  0.0  0:00.00 nginx 25921 vbart    20   0  121956   2176    1000 S   0.0  0.0  0:01.98 sshd 25923 vbart    20   0   40304   3840    2208 S   0.0  0.0  0:00.54 zsh如上所示,基准测试中还有大量的CPU资源剩余。wrk的结果如下:Running 1m test @ http://192.0.2.1:8000/1/1/1   12 threads and 50 connections   Thread Stats   Avg      Stdev     Max  +/- Stdev     Latency   226.32ms  392.76ms   1.72s   93.48%     Req/Sec    20.02     10.84    59.00    65.91%   15045 requests in 1.00m, 58.86GB read Requests/sec:    250.57 Transfer/sec:      0.98GB服务器处理4MB文件的平均时间从7.42秒降到226.32毫秒(减少了33倍),每秒请求处理数提升了31倍(250 vs 8)!对此,我们的解释是请求不再因为工作进程被阻塞在读文件,而滞留在事件队列中,等待处理,它们可以被空闲的进程处理掉。只要磁盘子系统能做到最好,就能服务好第一个客户端上的随机负载,NGINX可以使用剩余的CPU资源和网络容量,从内存中读取,以服务于上述的第二个客户端的请求。5. 依然没有银弹在抛出我们对阻塞操作的担忧并给出一些令人振奋的结果后,可能大部分人已经打算在你的服务器上配置线程池了。先别着急。实际上,最幸运的情况是,读取和发送文件操作不去处理缓慢的硬盘驱动器。如果我们有足够多的内存来存储数据集,那么操作系统将会足够聪明地在被称作“页面缓存”的地方,缓存频繁使用的文件。“页面缓存”的效果很好,可以让NGINX在几乎所有常见的用例中展示优异的性能。从页面缓存中读取比较快,没有人会说这种操作是“阻塞”。而另一方面,卸载任务到一个线程池是有一定开销的。因此,如果内存有合理的大小并且待处理的数据集不是很大的话,那么无需使用线程池,NGINX已经工作在最优化的方式下。卸载读操作到线程池是一种适用于非常特殊任务的技术。只有当经常请求的内容的大小,不适合操作系统的虚拟机缓存时,这种技术才是最有用的。至于可能适用的场景,比如,基于NGINX的高负载流媒体服务器。这正是我们已经模拟的基准测试的场景。我们如果可以改进卸载读操作到线程池,将会非常有意义。我们只需要知道所需的文件数据是否在内存中,只有不在内存中时,读操作才应该卸载到一个单独的线程中。再回到售货员那个比喻的场景中,这回,售货员不知道要买的商品是否在店里,他必须要么总是将所有的订单提交给配货服务,要么总是亲自处理它们。人艰不拆,操作系统缺少这样的功能。第一次尝试是在2010年,人们试图将这一功能添加到Linux作为fincore()系统调用,但是没有成功。后来还有一些尝试,是使用RWF_NONBLOCK标记作为preadv2()系统调用来实现这一功能(详情见LWN.net上的非阻塞缓冲文件读取操作和异步缓冲读操作)。但所有这些补丁的命运目前还不明朗。悲催的是,这些补丁尚没有被内核接受的主要原因,貌似是因为旷日持久的撕逼大战(bikeshedding)。另一方面,FreeBSD的用户完全不必担心。FreeBSD已经具备足够好的读文件取异步接口,我们应该用这个接口而不是线程池。6. 配置线程池所以,如果你确信在你的场景中使用线程池可以带来好处,那么现在是时候深入了解线程池的配置了。线程池的配置非常简单、灵活。首先,获取NGINX 1.7.11或更高版本的源代码,使用–with-threads配置参数编译。在最简单的场景中,配置看起来很朴实。我们只需要在http、 server,或者location上下文中包含aio threads指令即可:aio threads;这是线程池的最简配置。实际上的精简版本示例如下:thread_pool default threads=32 max_queue=65536; aio threads=default;这里定义了一个名为“default”,包含32个线程,任务队列最多支持65536个请求的线程池。如果任务队列过载,NGINX将输出如下错误日志并拒绝请求:thread pool "NAME" queue overflow: N tasks waiting错误输出意味着线程处理作业的速度有可能低于任务入队的速度了。你可以尝试增加队列的最大值,但是如果这无济于事,那么这说明你的系统没有能力处理如此多的请求了。正如你已经注意到的,你可以使用thread_pool指令,配置线程的数量、队列的最大值,以及线程池的名称。最后要说明的是,可以配置多个独立的线程池,将它们置于不同的配置文件中,用做不同的目的:http {     thread_pool one threads=128 max_queue=0;     thread_pool two threads=32;     server {         location /one {             aio threads=one;         }         location /two {             aio threads=two;         }     } … }如果没有指定max_queue参数的值,默认使用的值是65536。如上所示,可以设置max_queue为0。在这种情况下,线程池将使用配置中全部数量的线程,尽可能地同时处理多个任务;队列中不会有等待的任务。现在,假设我们有一台服务器,挂了3块硬盘,我们希望把该服务器用作“缓存代理”,缓存后端服务器的全部响应信息。预期的缓存数据量远大于可用的内存。它实际上是我们个人CDN的一个缓存节点。毫无疑问,在这种情况下,最重要的事情是发挥硬盘的最大性能。我们的选择之一是配置一个RAID阵列。这种方法毁誉参半,现在,有了NGINX,我们可以有其他的选择:# 我们假设每块硬盘挂载在相应的目录中:/mnt/disk1、/mnt/disk2、/mnt/disk3 proxy_cache_path /mnt/disk1 levels=1:2 keys_z                  use_temp_path=off; proxy_cache_path /mnt/disk2 levels=1:2 keys_z                  use_temp_path=off; proxy_cache_path /mnt/disk3 levels=1:2 keys_z                  use_temp_path=off; thread_pool pool_1 threads=16; thread_pool pool_2 threads=16; thread_pool pool_3 threads=16; split_clients $request_uri $disk {     33.3%     1;     33.3%     2;     *         3; } location / {     proxy_pass http://backend;     proxy_cache_key $request_uri;     proxy_cache cache_$disk;     aio threads=pool_$disk;     sendfile on; }在这份配置中,使用了3个独立的缓存,每个缓存专用一块硬盘,另外,3个独立的线程池也各自专用一块硬盘。缓存之间(其结果就是磁盘之间)的负载均衡使用split_clients模块,split_clients非常适用于这个任务。在 proxy_cache_path指令中设置use_temp_path=off,表示NGINX会将临时文件保存在缓存数据的同一目录中。这是为了避免在更新缓存时,磁盘之间互相复制响应数据。这些调优将带给我们磁盘子系统的最大性能,因为NGINX通过单独的线程池并行且独立地与每块磁盘交互。每块磁盘由16个独立线程和读取和发送文件专用任务队列提供服务。我敢打赌,你的客户喜欢这种量身定制的方法。请确保你的磁盘也持有同样的观点。This example is a great example of how flexible NGINX can be specifically tuned for hardware. It's like you gave NGINX an order to let the machine and data use the best posture to do the basics. Moreover, through NGINX's fine-grained tuning in user space, we can ensure that the software, operating system, and hardware work in optimal modes and utilize system resources as efficiently as possible. 7. SummaryTo sum up, the thread pool is a great feature that pushes NGINX to a new level of performance and gets rid of a well-known long-term hazard - blocking - especially when we really face it When there is a lot of content. Even, there are more surprises. As mentioned earlier, with this new interface, it is possible to offload any long-blocking operations without any performance penalty. NGINX opens up a new world with a large number of new modules and features. Many popular libraries still do not provide asynchronous non-blocking interfaces, which previously made them incompatible with NGINX. We could spend a lot of time and resources developing our own non-blocking prototyping library, but is it always worth it? Now, with thread pools, we can use these libraries relatively easily without affecting the performance of these modules.





The above introduces the thread pool and performance analysis of Nginx, including aspects of it. I hope it will be helpful to friends who are interested in PHP tutorials.

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn