


Dear friends, listen to my advice and write code to provide methods for others to call, whether it is an internal system call, an external system call, or a passive trigger call (such as MQ consumption, callback execution etc.), be sure to add necessary condition checks. Don't believe some colleagues who say that this condition will definitely be transmitted, it will definitely have a value, it will definitely not be empty, etc. No, just before the Chinese New Year, I was tricked and had a production accident, so my year-end bonus was basically reduced by half.
I decided to focus on the code itself, rather than the people, to ensure high system availability and stability. Here are a few small lessons that may help you too.
1. What happened
My business scenario is: when business A changes, it will trigger the sending of MQ messages, and then the application will receive the MQ messages and write the data to Elasticsearch after processing.
(1) Received an abnormal alarm from business A. The alarm at that time was as follows:
(2) It seems a bit strange at first glance. How could it be a Redis exception? Then I connected to Redis and there was no problem. I checked the Redis cluster again and everything was normal. So I let it go, thinking it was an accidental network problem.
Then, in the technical problem group, customer service reported that some users were experiencing abnormal situations. I immediately checked the system to confirm the existence of sporadic problems.
(4) So I looked at a few core components out of habit:
- Gateway status, load status of core business Pods, and load status of user center Pods.
- Mysql situation: memory, CPU, slow SQL, deadlock, number of connections, etc.
It was found that slow SQL and long metadata lock time were found, mainly due to the large amount of data and slow execution speed caused by the full table query of a large table, which in turn caused the metadata lock to last too long and be exhausted. Number of database connections.
SELECT xxx,xxx,xxx,xxx FROM 一张大表
(6) After immediately killing several slow sessions, I found that the system was still not fully restored. Why? Now that the database is normal, why has it not been fully restored? I continued to look at the application monitoring and found that 2 of the 10 Pods in the user center were abnormal, and the CPU and memory were exhausted. No wonder there are occasional abnormalities when using it. So I quickly restarted the Pod and restored the application first.
(7) The problem has been found, and then we will continue to investigate why the Pod in the user center hung up. Start analyzing from the following doubt points:
- Is there something wrong with the code for synchronizing data to Elasticsearch? Why can't it connect to Redis?
- Could there be too many exceptions, causing the thread pool queue for sending exception alarm messages to be full, and then OOM?
- Where can we perform an unconditional full table query on the large table of business A?
(8) Continue to investigate suspicion point a. At first, I thought that the Redis connection could not be obtained, which caused the exception to enter the thread pool queue, and then the queue burst, causing OOM. According to this idea, I modified the code, upgraded, and continued to observe, but the same slow SQL and user center explosion still occurred. Because there is no abnormality, suspicion point b can also be ruled out.
(9) At this point, it is almost certain that point C is suspected. The full table query of the large table of business A is called, which causes the memory in the user center to be too large, and the JVM has no time to recycle it, and then directly explodes the CPU. . At the same time, because the entire table data is too large, the metadata lock time during query is too long, causing the connection to be unable to be released in time, and eventually almost exhausted.
(10) So the necessary verification conditions for querying the large table of business A were modified and redeployed for online observation. There was a problem with the final positioning.
2. Cause of the problem
Because when changing the business table B, you need to send an MQ message (synchronize the data of the business table A to ES). After receiving the MQ message, query the data related to the business table A, and then synchronize the data to Elasticsearch.
But when changing the business table B, there were no necessary conditions required for the business table A, and I also did not verify the necessary conditions, which resulted in a full table scan of the large table of business A. because:
某些同事说,“这个条件肯定会传、肯定有值、肯定不为空...”,结果我真信了他!!!
Due to the frequent changes in the business B table at that time, more MQ messages were sent and consumed, which triggered more full table scans of the large table of business A, which in turn led to more Mysql metadata lock times that were too long and the final connection Excessive data consumption.
At the same time, the results of the large table query of business A are returned to the memory of the user center every time, thus triggering JVM garbage collection, but it cannot be recycled. In the end, the memory and CPU are exhausted.
As for the exception that Redis cannot get the connection, it is just a smoke bomb. Because there are too many MQ events sent and consumed, a small number of threads cannot get the Redis connection in an instant.
In the end, I added condition verification in the code for consuming MQ events, and also added necessary condition verification at the query business A table, redeployed it online, and solved the problem.
3. Summarize lessons
After this incident, I also summed up some lessons and share them with you:
(1) Always be alert to online problems. Once a problem occurs, you must not let it go and investigate it quickly. Don’t doubt the problem of network jitter anymore. Most problems have nothing to do with the network.
(2) The large business table itself must be protected, and the query must add necessary condition verification.
(3) When consuming MQ messages, you must verify the necessary conditions and do not trust any information source.
(4) Never believe some colleagues who say, "This condition will definitely be transmitted, it will definitely have a value, it will definitely not be empty," etc. In order to ensure the high availability and stability of the system, we only recognize the code and not the people.
(5) General troubleshooting sequence when problems occur:
- CPU, deadlock, slow SQL of database.
- CPU, memory, and logs of the application's gateway and core components.
(6) Business observability and alarms are essential and must be comprehensive, so that problems can be discovered and solved faster.
The above is the detailed content of The system is broken. It only recognizes the code but not the people.. For more information, please follow other related articles on the PHP Chinese website!

err_connection_reset的解决办法:1、检查网络连接;2、清除浏览器缓存和Cookie;3、关闭防火墙和杀毒软件;4、调整路由器设置;5、检查服务器状态;6、刷新DNS缓存;7、重置网络设置。详细介绍:1、检查网络连接,首先确保设备已连接到可用的网络,并且网络连接稳定;2、清除浏览器缓存和Cookie,浏览器缓存和Cookie可能会导致等等。

为什么电脑版微信发不了文件原因:可能是文件过大,不能超过,因为PC端的设置为小于100M还有可能是自身网络不够稳定。登陆。进入以后我们看到里面的二维码,拿出手机扫描二维码,便可成功登陆。登陆成功以后,出现一个微信聊天界面。另一方面,当微信电脑版遇到网络连接故障、软件版本过旧、电脑存储空间不足等问题时,也有可能出现无法发送文件的情况。所以,在尝试发送文件前,可以先检查一下待发送的文件是否超出限制大小。以华为MateBookX,win10,微信0.21为例。可能是文件过大,因PC端的设置为小于100

手机电话打不出去的原因:1、信号问题;2、手机账户问题;3、手机设置问题;4、SIM卡问题;5、运营商网络问题;6、手机硬件问题;7、软件问题;8、特定区域或时间段问题;9、服务提供商问题;10、其他问题。详细介绍:1、信号问题,可能是手机无法拨打电话最常见的因素之一,如果手机没有足够的信号,可能无法拨打电话;2、手机账户问题,如果手机账户欠费或者被暂停服务等等。

微信电话对方忙线中是指对方正在与其他人进行电话通话,其他原因是对方未接听、网络问题和软件故障等。详细介绍:1、对方正在与其他人通话,当微信电话拨打给对方时,如果对方正在与另一个或多个联系人通话,那么系统会提示对方忙线中,在这种情况下,需要等待对方结束当前通话,才能成功连接到对方;2、对方未接听,对方可能因为各种原因没有接听到微信电话,这可能是因为对方正在忙于处理其他事务等等。

如何使用Python调用百度地图API实现地理位置查询功能?随着互联网的发展,地理位置信息的获取和利用越来越重要。百度地图是一款非常常见和实用的地图应用,它提供了丰富的地理位置查询服务。本文将介绍如何使用Python调用百度地图API实现地理位置查询功能,并附上代码示例。申请百度地图开发者账号和应用首先,你需要拥有一个百度地图开发者账号,并创建一个应用。登录

楔子我们知道对象被创建,主要有两种方式,一种是通过Python/CAPI,另一种是通过调用类型对象。对于内置类型的实例对象而言,这两种方式都是支持的,比如列表,我们即可以通过[]创建,也可以通过list(),前者是Python/CAPI,后者是调用类型对象。但对于自定义类的实例对象而言,我们只能通过调用类型对象的方式来创建。而一个对象如果可以被调用,那么这个对象就是callable,否则就不是callable。而决定一个对象是不是callable,就取决于其对应的类型对象中是否定义了某个方法。如

PHP摄像头调用技巧:如何实现多摄像头切换摄像头应用已经成为许多Web应用的重要组成部分,例如视频会议、实时监控等等。在PHP中,我们可以使用各种技术来实现对摄像头的调用和操作。本文将重点介绍如何实现多摄像头的切换,并提供一些示例代码来帮助读者更好地理解。摄像头调用基础在PHP中,我们可以通过调用JavaScript的API来实现摄像头的调用。具体来说,我们

如何解决PHP开发中的外部资源访问和调用,需要具体代码示例在PHP开发中,我们经常会遇到需要访问和调用外部资源的情况,比如API接口、第三方库或者其他服务器资源。在处理这些外部资源时,我们需要考虑如何进行安全的访问和调用,同时保证性能和可靠性。本文将介绍几种常见的解决方案,并提供相应的代码示例。一、使用curl库进行外部资源调用curl是一个非常强大的开源库


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

VSCode Windows 64-bit Download
A free and powerful IDE editor launched by Microsoft

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Zend Studio 13.0.1
Powerful PHP integrated development environment

mPDF
mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

SAP NetWeaver Server Adapter for Eclipse
Integrate Eclipse with SAP NetWeaver application server.
