Alarm troubleshooting guide

Summary Description

The WeChat public platform has opened the interface to the outside world and alarms when the WeChat server fails to push messages to developers for a predetermined number of times When the threshold reaches the threshold, the alarm message will be sent to the designated WeChat alarm group (setting method: Public Platform->Development-Operation and Maintenance Center->Interface Alarm). Developers are asked to actively pay attention to the alarm, solve the fault immediately, and improve the WeChat public service quality.

In order to better troubleshoot problems based on the examples at the end of the alarm information (openid and timestamp stamps are provided), developers need to add key information at each level such as the access layer and logic layer. Detailed logs to help quickly locate problems.

There are currently 2 types of alarms:

1. General alarms, which all developers need to pay attention to.

QQ截图20170207153412.png

#2. If the public account third-party platform calls the police, you can only apply to become a developer of the public account third-party platform on the WeChat open platform (open.weixin.qq.com). Only then do you need to pay attention to this alarm.

QQ截图20170207153428.png

The following are examples of specific alarms and troubleshooting guidelines.

Alarm content description

Alarm content description:

a)appid：公众号appid
b)昵称: 公众号昵称
c)时间：所有报警，都会提供首次发生异常的时间。（如首次发生超时的时间，首次发生回应失败的时间）
d)内容：错误的具体描述
e)次数：发生失败的次数
f)错误样例：错误样例里注明了一些帮助查找问题的信息。如：首次超时开发者的IP和推送消息类型。如果是回应失败，错误样例还会注明首次回应失败时开发者的回包。

Under normal circumstances, through the IP and time provided by the alarm, The message type can quickly locate the cause of the third-party problem.

Alarm example 1: Timeout alarm

Appid: wxxxxxx
昵称: WxNickName
时间: 2014-12-01 20:12:00
内容: 微信服务器向公众号推送消息或事件后，开发者5秒内没有返回
次数: 5分钟 1272次
错误样例: [IP=203.205.140.29][Event=UnSubscribe]

This alarm means: When the WeChat server pushed the unfollow event to the developer, the developer did not return the result within 5 seconds. It happened 1272 times in the 5 minutes from 2014-12-01 20:12:00 to 2014-12-01 20:17:00. The first timeout occurred within 5 minutes was: 2014-12-01 20:12:00, the developer's IP was: 203.205.140.29, and the event type was an unfollow event.

Alarm example 2: Response failure

Appid: wxxxx
昵称: WxNickName
时间: 2014-12-01 20:12:00
内容: 微信服务器向公众号推送消息或事件后，得到的回应不合法
次数: 5分钟 1320次
错误样例: [Event=Click] [ip=58.248.9.218][response_length=10][response_content=Error 500:]

This alarm means: When the WeChat server pushes a custom menu click event to the developer, the developer's return result is illegal. It happened 1320 times within 5 minutes from 2014-12-01 20:12:00 to 2014-12-01 20:17:00. The first time the response failed within 5 minutes was: 2014-12-01 20:12:00, the developer's IP was: 58.248.9.218, the event type was a click menu event, and the length of the content returned by the third party was 10 bytes, the content is "Error 500:".

Alarm example 3: Connection timeout

Appid: wxxxx
昵称: WxNickName
时间: 2015-02-04 20:13:09
内容: 微信服务器连接公众号开发者服务器时发生超时，超时时间为5秒
次数: 5分钟 7289次
错误样例: [IP=180.150.190.135][Msg=Text]

This alarm means: When the WeChat server pushes text messages from fans to the developer, it cannot connect to the server address filled in by the developer. It occurred 7289 times within 5 minutes from 2015-02-04 20:13:09 to 2015-02-04 20:18:00. The first time a connection timeout occurred within these 5 minutes was: 2015-02-04 20:13:09, the developer's IP is: 180.150.190.135, and the event type is a message pushed by the user.

Troubleshooting methods for various alarms

1.DNS failure

This error occurs when the WeChat server pushes a message to the developer. Failed to resolve dns. If you encounter this alarm, please confirm with the developer:

a）填写的url,域名是否有误；
b) 域名是否发生变化，如过期，更新等。

If it is not the above two problems, please contact the WeChat public platform.

2.Dns timeout

Currently there will be no such error.

3. Connection timeout

This error means that the WeChat server and the developer server did not successfully connect within 3 seconds. The alarm message will provide the time when the first connection failure occurred and the IP address of the connection. If this alarm is encountered, the developer please confirm:

a)该IP是否有误。
b)该IP机器是否过载，连接过多。
c)如果是第三方提供服务器托管，托管商是否有故障。
d)网络运营商是否有故障。

4. Request timeout

The WeChat server pushes messages or events to the developer server, but the developer does not return within 5 seconds. When the request times out, the alarm message will provide the time when the request timeout occurred for the first time, the developer IP and the message type. The developer please confirm:

a）该IP是否有误
b）该IP是否接收到报警消息给出的该消息类型的请求
c）该请求是否处理时间过长

5. Response failure

The developer does not reply to the message according to the reply message format in the wiki, or a network error occurs, an alarm will be issued for response failure, and the alarm message will be provided The time when the request response failed for the first time, the developer's IP, message type and response message content, please confirm it:

a）该IP是否有误
b）该IP是否发生网络错误
c）该业务处理逻辑是否没有按照wiki规范回复消息，或是进入了异常逻辑。

6.MarkFail (automatic blocking)

WeChat background The number of developers' failures will be counted in real time. When a large number of failures occur in pushing messages to developers, the WeChat server will automatically block the developer, stop pushing any messages within 1 minute, and send an alarm to the WeChat group. This alarm is the highest level alarm. When developers receive this alarm, please handle the background failure as soon as possible and restore services. In fact, before receiving this alarm, developers will inevitably receive alarms such as connection timeout, request timeout or response failure. Developers need to solve these faults immediately to avoid being blocked by the WeChat server and seriously affecting public account services!

7. Pushing component_verify_ticket timed out & 8. Pushing component_verify_ticket failed & 9. Pushing component message timed out & 10. Pushing component message failed

Only third-party platform developers with public accounts will respond to the above 4 alarms Received, other public account developers do not need to pay attention. Since the public account third-party platform carries more public accounts, the service quality of the public account third-party platform needs stricter requirements and alarms, so these four special events are reported separately. The specific problem finding method is the same as 4 and 5, so I won’t go into details here. For specific application and development implementation of the public account third-party platform, please go to the WeChat Open Platform (open.weixin.qq.com)

FAQ

1. How to troubleshoot DNS failure?

1.Ping测试你们MP上配置的url里的域名，确认是否能够得到正确的IP。如不能得到或者错误，请到你们的域名托管商管理系统上检查配置。
2.如1能够得到正确的IP，又有DNS失败的报警；请使用DNS服务器182.254.116.116             来再测试验证。Linux :   dig @182.254.116.116 域名；windows 修改网络配置里的DNS服务器地址，然后再ping 域名。如果得到的IP不正确或者得不到，请联系微信团队。

2. How to solve the connection timeout problem?

1.查看是否网络环境问题。
   （1）使用公众平台接口，获取到微信回调服务器的IP，https://api.weixin.qq.com/cgi-bin/getcallbackip?access_token=ACCESS_TOKEN，
   （2）在你们的服务上ping 测试，检查你们服务器到微信回调用服务器的网络质量情况。如有网络问题，请联系你们的服务器提供商解决。
2.查看接入层服务器连接数，负载，nginx的配置，允许的连接个数。查看nginx错误日志是否有“Connection reset by peer”或“Connection timed out”错误日志，如有说明nginx连接数过超负载。
3.建议搭建测试工具，对系统进行心跳检查，对系统负载，连接数，处理数，处理耗时进行实时监控报警。
对于nginx配置，这里提供官方文档和一篇简单配置介绍链接，希望有帮助： http://nginx.org/en/docs/，重点关注连接数配置，日志配置等。nginx的一些重要配置参考例子如下：
worker_processes  16;          //CPU核数
error_log  logs/error.log  info;   //错误日志log
worker_rlimit_nofile 102400;     //打开最大句柄数
events {
    worker_connections  102400;   //允许最大连接数
}
//请求日志记录，关键字段：request_time-请求总时间，upstream_response_time后端处理时 间
log_format  main  '$remote_addr  - $remote_user [$time_local] "$request" '
                 '$status $body_bytes_sent "$http_referer" '
                  '"$http_user_agent" "$http_x_forwarded_for" "$host"  "$cookie_ssl_edition" '
                 '"$upstream_addr"   "$upstream_status"  "$request_time"  '
                 '"$upstream_response_time" ';
   access_log  logs/access.log  main;

3. How to solve the request timeout problem?

Each module needs to have a complete log, which can find out the time-consuming information of each request in each module. With the information provided by WeChat alarm, it is easy to locate which server has the problem. Common reasons are:

1）机器负载太高，耗时增加
2）机器处理异常，消息丢失
3）机器异常，对于机器处理异常，建议尽快修复bug，对于机器异常，请尽快屏蔽有问题的机器。这里对机器负载太高，简单提供可行的解决方案。方案一：优化性能，扩容。检查负载情况（cpu，内存，io，网络，详见附录），根据具体性能瓶颈的不同，采取不同的优化方式。方案二：异步处理。如果微信服务器推送的消息来不及实时处理，可将消息先存储，先返回success给微信服务器，后台可后续再处理消息，如果需要回复用户消息，可通过调用客服消息接口API再回复用户消息。

4. How to solve the access_token storage and usage problem?

Frequently, third parties report that access_token causes service interruption. When troubleshooting the problem on the public platform, we find that most third parties are frantically refreshing access_token, causing the access_token to become invalid beyond the interface frequency limit. Here is a simpler access_token storage and usage solution.

1）中控服务器定时（建议1小时）调用微信api，刷新access_token,将新的access_token 存入mysql（或其他存储），
2）其他工作服务器每次调用微信api时从mysql(或其他存储)获取access_token，并可在内存缓存一段时间（建议1分钟）。

The public platform will ensure that after the access_token is refreshed, the old access_token can still be used within 5 minutes to ensure that the third party will not fail to call the WeChat API when updating the access_token.

Appendix

Appendix 1: Message event list and response format pushed by WeChat

For details, please see: WeChat Push Messages and Event Descriptions

Appendix 2: Common Tools for Viewing Server Performance Load

Next This is a brief introduction to commonly used tools for checking server performance load. Please refer to the detailed tool usage separately.

1. Check the performance load of the CPU

a)uptime

is used to observe the overall load of the server. The system load refers to the running queue (1 minute, 5 minutes, 15 minutes ago ), the average length needs to be less than the number of CPUs under normal circumstances.

b)vmstat

vmstat is the abbreviation of Virtual Meomory Statistics (virtual memory statistics), which can monitor the virtual memory, process, and CPU activities of the operating system. It performs statistics on the overall situation of the system. It is usually tested using the vmstat 5 5 (meaning data is generated every 5 seconds, five times) command. A data summary will be obtained that reflects the real system conditions.

c)top The top command is one of the most popular Unix/Linux performance tools. System administrators can run the top command to monitor processes and overall Linux performance.

2. Check the memory performance load

a)free

The free command under Linux can be used to check the current system memory usage. It displays the remaining memory in the system. and used physical memory and swap memory, as well as shared memory and buffers used by the core.

3. Check the performance load of the network

b)netstat

Netstat is a console command. It is a very useful tool for monitoring TCP/IP networks. It can display Routing tables, actual network connections, and status information for each network interface device. Netstat is used to display statistical data related to IP, TCP, UDP and ICMP protocols. It is generally used to check the network connection of each port of the machine.

c)sar

#sar (System Activity Reporter system activity report) is currently one of the most comprehensive system performance analysis tools on Linux. It can report system activities from many aspects. Including: file reading and writing, system call usage, disk I/O, CPU efficiency, memory usage, process activities and IPC-related activities, etc. This article mainly uses the CentOS 6.3 x64 system as an example to introduce the sar command.

4. Check the performance load of the disk

a)iostat

The iostat command under Linux can be used to report central processing unit (CPU) statistics and the entire system, adapter, and tty device , disk and CD-ROM input/output statistics.

Appendix 3: nginx configuration and troubleshooting guidelines

Troubleshooting methods for nginx problems

When a direct timeout occurs, the process returns When a slow alarm occurs, the troubleshooting reference methods on the nigix side are as follows: 1. Check the request log, tail -f logs/access.log, and look at the upstream_status field.

   200：表示正常；
   502/503/504：表示处理慢，或者后端down机；再看upstream_response_time返回的时间是否真的较慢，有没有上百毫秒，或更高的，有则说明是后端服务有问题。
   404：表示请求的路径不存在或不对，文件不在了。需要检查你配置在公众平台上的url路径是否正确； 服务器上的文件、程序是否存在。
   403：表示无权限访问。 检查一下nginx.conf 是否有特殊的访问配置。
   499: 则是客户端的问题，请联系微信团队。  此错误少见。

2. Check the error logs, tail -f logs/error_log, to see if there are error error logs such as connect() failed, Connection refused, Connection reset by peer, etc. If there are any, it indicates that there may be a connection with nginx. Data overload and other situations.

   （1）查看系统的网络连接数情况确认是否有较大的链接数
    # netstat -n | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}' 
    解析： 
	   CLOSED //无连接是活动的或正在进行 
	   LISTEN //服务器在等待进入呼叫 
	   SYN_RECV //一个连接请求已经到达，等待确认 
 	   SYN_SENT //应用已经开始，打开一个连接 
	   ESTABLISHED //正常数据传输状态/当前并发连接数  
	   FIN_WAIT1 //应用说它已经完成  
	   FIN_WAIT2 //另一边已同意释放  
	   ITMED_WAIT //等待所有分组死掉 
	   CLOSING //两边同时尝试关闭 
	   TIME_WAIT //另一边已初始化一个释放 
	   LAST_ACK //等待所有分组死掉
	   
   （2）查看系统的句柄配置情况，ulimit -n ，确认是否过小（小于请求数）
   （3）worker_rlimit_nofile、worker_connections配置项，是否过小（小于请求数）