Home > Article > System Tutorial > Let me go, the Linux system CPU is 100% full!
Yesterday afternoon, I suddenly received an email alert from the operation and maintenance department, which showed that the CPU utilization rate of the data platform server was as high as 98.94%. In recent times, this utilization rate has continued to be above 70%. At first glance, it seems that the hardware resources have reached a bottleneck and need to be expanded. But after thinking about it carefully, I found that our business system is not a highly concurrent or CPU-intensive application. This utilization rate is too exaggerated, and the hardware bottleneck cannot be reached so quickly. There must be a problem with the business code logic somewhere.
First log in to the server and use the top command to confirm the specific situation of the server, and then analyze and judge based on the specific situation.
By observing the load average and the load evaluation standard (8 cores), it can be confirmed that the server has a high load;
Observing the resource usage of each process, we can see that the process with process ID 682 has a higher CPU ratio
Here we can use the pwdx command to find the business process path based on pid, and then locate the person in charge and the project:
It can be concluded that this process corresponds to the web service of the data platform.
The traditional solution is generally 4 steps:
1. top oder by with P:1040 //First sort by process load and find maxLoad(pid)
2. top -Hp process PID: 1073 // Find the relevant load thread PID
3. printf "0x%x" Thread PID: 0x431 // Convert the thread PID to hexadecimal to prepare for later searching for jstack logs
4. jstack process PID | vim /hex thread PID – // For example: jstack 1040|vim /0x431 –
But for online problem locating, every second counts, and the above four steps are still too cumbersome and time-consuming. Oldratlee, who introduced Taobao before, encapsulated the above process into a tool: show-busy-java-threads. sh, you can easily locate this type of problem online:
It can be concluded that the execution CPU of a time tool method in the system is relatively high. After locating the specific method, check whether there are performance problems in the code logic.
※ If the online problem is more urgent, you can omit 2.1 and 2.2 and directly execute 2.3. The analysis here is from multiple angles just to present you with a complete analysis idea.
After the previous analysis and troubleshooting, we finally located a problem with time tools, which caused excessive server load and CPU usage.
Then it can be concluded that if the current time is 10 a.m. that day, the number of calculations for a query is 106060n times = 36,000n calculations, and As time goes by, the number of single queries increases linearly as it gets closer to midnight. Since a large number of query requests from modules such as real-time query and real-time alarm require calling this method multiple times, a large amount of CPU resources are occupied and wasted.
After locating the problem, the first consideration is to reduce the number of calculations and optimize the exception method. After investigation, it was found that when used at the logic layer, the contents of the set collection returned by this method were not used, but the size value of the set was simply used. After confirming the logic, simplify the calculation through a new method (current seconds - seconds in the early morning of the day), replace the called method, and solve the problem of excessive calculations. After going online, we observed the server load and CPU usage. Compared with the abnormal time period, the server load and CPU usage dropped by 30 times and returned to normal. At this point, the problem has been solved.
![Yesterday afternoon, I suddenly received an email alert from the operation and maintenance department, showing that the CPU utilization rate of the data platform server was as high as 98.94%. In recent times, this utilization rate has continued to be above 70%. At first glance, it seems that the hardware resources have reached a bottleneck and need to be expanded. But after thinking about it carefully, I found that our business system is not a highly concurrent or CPU-intensive application. This utilization rate is too exaggerated, and the hardware bottleneck cannot be reached so quickly. There must be a problem with the business code logic somewhere.
First log in to the server and use the top command to confirm the specific situation of the server, and then analyze and judge based on the specific situation.
By observing the load average and the load evaluation standard (8 cores), it can be confirmed that the server has a high load;
Observing the resource usage of each process, we can see that the process with process ID 682 has a higher CPU ratio
Here we can use the pwdx command to find the business process path based on pid, and then locate the person in charge and the project:
It can be concluded that this process corresponds to the web service of the data platform.
The traditional solution is generally 4 steps:
1. top oder by with P:1040 //First sort by process load and find maxLoad(pid)
2. top -Hp process PID: 1073 // Find the relevant load thread PID
3. printf "0x%x" Thread PID: 0x431 // Convert the thread PID to hexadecimal to prepare for later searching for jstack logs
4. jstack process PID | vim /hex thread PID – // For example: jstack 1040|vim /0x431 –
But for online problem locating, every second counts, and the above four steps are still too cumbersome and time-consuming. Oldratlee, who introduced Taobao before, encapsulated the above process into a tool: show-busy-java-threads. sh, you can easily locate this type of problem online:
It can be concluded that the execution CPU of a time tool method in the system is relatively high. After locating the specific method, check whether there are performance problems in the code logic.
※ If the online problem is more urgent, you can omit 2.1 and 2.2 and directly execute 2.3. The analysis here is from multiple angles just to present you with a complete analysis idea.
After the previous analysis and troubleshooting, we finally located a problem with time tools, which caused excessive server load and CPU usage.
Then it can be concluded that if the current time is 10 a.m. that day, the number of calculations for a query is 106060n times = 36,000n calculations, and As time goes by, the number of single queries increases linearly as it gets closer to midnight. Since a large number of query requests from modules such as real-time query and real-time alarm require calling this method multiple times, a large amount of CPU resources are occupied and wasted.
After locating the problem, the first consideration is to reduce the number of calculations and optimize the exception method. After investigation, it was found that when used at the logic layer, the contents of the set collection returned by this method were not used, but the size value of the set was simply used. After confirming the logic, simplify the calculation through a new method (current seconds - seconds in the early morning of the day), replace the called method, and solve the problem of excessive calculations. After going online, we observed the server load and CPU usage. Compared with the abnormal time period, the server load and CPU usage dropped by 30 times and returned to normal. At this point, the problem has been solved.
The above is the detailed content of Let me go, the Linux system CPU is 100% full!. For more information, please follow other related articles on the PHP Chinese website!