Home >Operation and Maintenance >Nginx >How to solve the problem of 499 and failover mechanism failure caused by improper nginx configuration

How to solve the problem of 499 and failover mechanism failure caused by improper nginx configuration

PHPz
PHPzforward
2023-06-02 19:54:241789browse

    #The meaning and possible reasons of 499

    499 is not actually the standard status code of the HTTP protocol, but a custom status code of nginx, which is not included in the A clear explanation of this status code can be found in the official nginx documentation. Here is a more professional explanation from a blog post:

    HTTP error 499 simply means that the client shut off in the middle of processing the request through the server. The 499 error code puts better light that something happened with the client, that is why the request cannot be done. So don't fret: HTTP response code 499 is not your fault at all.

    The general idea is that 499 generally means that the client actively terminates the processing process while the HTTP request is still being processed - disconnecting the corresponding network connection. 499 generally means that some problems have occurred on the client side and have nothing to do with the server. relation.
    The following are the comments in the nginx source code:

    /*
    * HTTP does not define the code for the case when a client closed
    * the connection while we are processing its request so we introduce
    * own code to log such situation when a client has closed the connection
    * before we even try to send the HTTP header to it
    */
    #define NGX_HTTP_CLIENT_CLOSED_REQUEST     499

    It means that nginx has introduced a custom code 499 to record the scenario where nginx has not finished processing its request when the client disconnects.
    Looking back many years ago when I first encountered the 499 scene, I also saw similar answers when searching for information on the Internet. Therefore, I always thought that 499 had little to do with the server, and it should all be caused by the client.

    An example of a client's proactive behavior leading to 499

    I once encountered a search Lenovo interface, and its 499 ratio was dozens of times higher than other APIs--Yi Qi Jue Chen, just look at it This API has basically been above the alarm threshold for a long time, and we have also tracked the specific reasons for the exception. Finally, we worked with our client partners to come to the conclusion: it is normal for the 499 ratio of Lenovo interface searches to be high because:

    • The calling scenario of this API is that when the user enters a search term in the search box, every time the user enters a character, the API will be immediately called with the latest input and the returned association results will be displayed to the user, thereby achieving a near real-time Search function of Lenovo.

    • Since every time the user enters a new character, the latest api call request is triggered, even if the previous call request is still in progress, the client should directly end these which have no practical effect. The old request, which is reflected in the nginx log, is 499 that the client actively disconnected.

    So although the search for Lenovo API is different from the high ratio of 499 for ordinary API, it is completely reasonable. The client has the responsibility to actively disconnect, but has not done anything wrong. There are no problems on the server side.

    An example of client passive behavior causing 499

    Another example where client behavior was previously believed to cause 499 is the push peak. Some users may kill the app instantly after opening the app through push. During the peak push period, the pressure on the server is usually relatively high, and the response itself will be slower than during the off-peak period. At this time, some API requests may still be in progress. At this time, the user kills the app - the app dies unjustly and is helpless - and the corresponding connection will naturally be It was disconnected and recycled by the OS, which also resulted in 499. In this scenario, there is no problem on the server side.

    Server issues may cause 499?

    Through the above two examples, at first glance, 499 is caused by the client side, whether it is active or passive behavior. It is these two examples that deepen the blogger’s mind that 499 should be ignored on the server side. Awareness of responsibility.
    To summarize the nginx error codes that may be caused by server-side errors, the main scenarios should be the following:

    • 500: Internal error, usually the request parameter directly causes the upstream processing thread An error occurs when executing the code. The business code or framework directly returns Internal Error

    • 502: Generally, the upstream server hangs directly and cannot be connected. nginx cannot access upstream, so Bad Gateway

    • is returned.
    • 503: The upstream load is too high--but it did not hang and returned directly to Service Unavailable

    • 504: The upstream processing request takes too long, and nginx times out while waiting for the upstream to return. Gateway Timeout

    So whether it is a code execution error, the service hangs, the service is too busy, or the request processing takes too long and the HTTP request fails, the 5XX will be returned and will not be triggered at all. 499.
    Generally speaking, this is indeed the case, but this time the new Pingfeng 499 is not a general situation. When searching for information on the Internet, some people have suggested that nginx 499 may be caused by the server taking too long to process, causing the client to actively disconnect after timeout. Yes, but this situation should not belong to scenario 4 according to the above description-upstream takes too long to process the request, so nginx returns 504, right?
    So it seems that the server-side processing takes too long, which may cause the client to actively disconnect 499, or nginx to return Gateway Timeout 504. So what is the key factor leading to this difference?
    To put it simply, if the client disconnects first and is detected by nginx, it will be 499. If the upstream takes too long and the timeout is first determined by nginx, it will be 504. So the key is nginx’s time setting for the upstream timeout, which is here. I quickly took a look at the timeout related configuration of nginx. Well, the relevant timeout period was not explicitly configured--!

    504 determination related timeout configuration in nginx

    Since the api and nginx communicate through the uwsgi protocol, the key timeout configuration parameters are as follows:

    Syntax:	uwsgi_connect_timeout time;
    Default:	
    uwsgi_connect_timeout 60s;
    Context:	http, server, location
    Defines a timeout for establishing a connection with a uwsgi server. It should be noted that this timeout cannot usually exceed 75 seconds.
    Syntax:	uwsgi_send_timeout time;
    Default:	
    uwsgi_send_timeout 60s;
    Context:	http, server, location
    Sets a timeout for transmitting a request to the uwsgi server. The timeout is set only between two successive write operations, not for the transmission of the whole request. If the uwsgi server does not receive anything within this time, the connection is closed.
    Syntax:	uwsgi_read_timeout time;
    Default:	
    uwsgi_read_timeout 60s;
    Context:	http, server, location
    Defines a timeout for reading a response from the uwsgi server. The timeout is set only between two successive read operations, not for the transmission of the whole response. If the uwsgi server does not transmit anything within this time, the connection is closed.

    在未明确指定的情况下其超时时间均默认为60s,简单来说(实际情况更复杂一些但这里不进一步探讨)只有在upstream处理请求耗时超过60s的情况下nginx才能判定其Gateway Timeout 并按照504处理,然而客户端设置的HTTP请求超时时间其实只有15s--这其中还包括外网数据传输的时间,于是问题来了:每一个服务端处理耗时超过15s的请求,nginx由于还没达到60s的超时阈值不会判定504,而客户端则会由于超过本地的15s超时时间直接断开连接,nginx于是就会记录为499。
    通过回查nginx log,非高峰期的499告警时段确实是存在单台upstream 请求处理缓慢,耗时过长,于是可能导致:

    • 用户在需要block等待请求的页面等待虽然不到15s但是已经不耐烦了,直接采取切页面或者杀死app重启的方式结束当前请求。

    • 用户耐心等待了15s、或者非阻塞的后台HTTP请求超过了15s超过超时阈值主动断开连接结束了当前请求。

    服务端耗时过长导致的499

    上面已经知道近期新出现的单台upstream 偶发499是由于响应缓慢引起的,既然是由于客户端超时时间(15s)远小于nginx upstream超时时间(60s)引起的,这应该属于一个明显的配置不当,会导致三个明显的问题:

    • 将用户由于各种原因(如杀app)很快主动断开连接导致的499与客户端达到超时时间(这里是15s)导致的499混在了一起,无法区分客户端责任与服务端责任导致499问题。

    • 对于nginx判定为499的请求,由于认为是客户端主动断开,不会被认为是服务端导致的unsuccessful attempt而被计入用于failover判定的max_fails计数中,所以即便一个upstream大量触发了499,nginx都不会将其从可用upstream中摘除,相当于摘除不可用节点的功能失效,而由于负载过高导致499的upstream收到的请求依然不断增加最终可能导致更大的问题。

    • 对于判定为499的请求,也是由于不会被认为是unsuccessful attempt,所以uwsgi_next_upstream这一配置也不会work,于是当第一个处理请求的upstream耗时过长超时后,nginx不会尝试将其请求转发为下一个upstream尝试处理后返回,只能直接失败。

    那是不是把客户端超时时间调大?或者把nginx upstream超时时间调小解决呢?
    调大客户端超时时间当然是不合理的,任何用户请求15s还未收到响应肯定是有问题的,所以正确的做法应该是调小upstream的超时时间,一般来说服务端对于客户端请求处理时间应该都是在数十、数百ms之间,超过1s就已经属于超长请求了,所以不但默认的60s不行,客户端设置的15s也不能用于upstream的超时判定。
    最终经过综合考虑服务端各api的耗时情况,先敲定了一个upstream 5s的超时时间配置--由于之前没有经验首次修改步子不迈太大,观察一段时间后继续调整,这样做已经足以很大程度解决以上的3个问题:

    • 将用户由于各种原因(如杀app)很快主动断开连接导致的499与nginx达到upstream超时时间时主动结束的504区分开了。

    • 504会被纳入max_fails计算,触发nginx摘除失败节点逻辑,在单台机器故障响应缓慢时可以被识别出来暂时摘除出可用节点列表,防止其负载进一步加大并保证后续请求均被正常可用节点处理返回。

    • 当nginx等待upstream处理达到5s触发超时时,其会按照uwsgi_next_upstream配置尝试将请求(默认仅限幂等的GET请求)转交给下一个upstream尝试处理后返回,这样在单一upstream由于异常负载较高超时时,其他正常的upstream可以作为backup兜底处理其超时请求,这里客户端原本等待15s超时的请求一般在5~10s内可以兜底返回。

    通过proxy_ignore_client_abort配置解决499问题?

    在网上查找资料时还有网友提出解除nginx 499问题的一个思路是设置proxy_ignore_client_abort参数,该参数默认为off,将其设置为on 后,对于客户端主动断开请求的情况,nginx会ignore而以upstream实际返回的状态为准,nginx官方文档说明如下:

    Syntax:	proxy_ignore_client_abort on | off;
    Default:	
    proxy_ignore_client_abort off;
    Context:	http, server, location
    Determines whether the connection with a proxied server should be closed when a client closes the connection without waiting for a response.

    但是在客户端主动断开连接时,设置这个参数的意义除了使nginx log中记录的状态码完全按照upstream返回确定,而非表示客户端断连的499之外,对于实际问题解决完全没有任何帮助,感觉颇有把头埋进沙子的鸵鸟风格,不知道这个参数设置到底会有什么实用的场景。

    The reason why a single upstream occasionally responds slowly and times out during non-peak periods

    This is a good question. This problem only appeared recently. After solving the nginx mismatch problem mentioned above, try to troubleshoot this problem. , judging from the phenomenon, it should be that certain specific requests trigger upsteam CPU surges, and the slow response further affects the processing of subsequent requests, eventually causing all requests to respond slowly and trigger client 499.
    After the nginx mismatch problem is solved, if the slow timeout of a single upstream occurs again, nginx will quickly remove the problem upstream through failover to avoid further deterioration of the situation, and the GET request for the first access problem upstream timeout will also The backup will be forwarded to other available upstreams for processing and then returned, which has greatly reduced the impact of such exceptions.
    Finally, after correcting the configuration, occasional exceptions in a single upstream will trigger a small number of 504 threshold alarms for some POST APIs once every few days. The root cause of the problem is still being explored.

    The above is the detailed content of How to solve the problem of 499 and failover mechanism failure caused by improper nginx configuration. For more information, please follow other related articles on the PHP Chinese website!

    Statement:
    This article is reproduced at:yisu.com. If there is any infringement, please contact admin@php.cn delete