Home >System Tutorial >LINUX >Let me go, the Linux system CPU is 100% full!

Let me go, the Linux system CPU is 100% full!

WBOY
WBOYforward
2024-02-13 23:27:121200browse

Yesterday afternoon, I suddenly received an email alert from the operation and maintenance department, which showed that the CPU utilization rate of the data platform server was as high as 98.94%. In recent times, this utilization rate has continued to be above 70%. At first glance, it seems that the hardware resources have reached a bottleneck and need to be expanded. But after thinking about it carefully, I found that our business system is not a highly concurrent or CPU-intensive application. This utilization rate is too exaggerated, and the hardware bottleneck cannot be reached so quickly. There must be a problem with the business code logic somewhere.

2. Troubleshooting ideas

2.1 Locate high load process pid

First log in to the server and use the top command to confirm the specific situation of the server, and then analyze and judge based on the specific situation.

我去,Linux 系统 CPU 100% 打满了!

By observing the load average and the load evaluation standard (8 cores), it can be confirmed that the server has a high load;

我去,Linux 系统 CPU 100% 打满了!

Observing the resource usage of each process, we can see that the process with process ID 682 has a higher CPU ratio

2.2 Locate specific abnormal business

Here we can use the pwdx command to find the business process path based on pid, and then locate the person in charge and the project:

我去,Linux 系统 CPU 100% 打满了!

It can be concluded that this process corresponds to the web service of the data platform.

2.3 Locate the abnormal thread and specific code lines

The traditional solution is generally 4 steps:

1. top oder by with P:1040 //First sort by process load and find maxLoad(pid)

2. top -Hp process PID: 1073 // Find the relevant load thread PID

3. printf "0x%x" Thread PID: 0x431 // Convert the thread PID to hexadecimal to prepare for later searching for jstack logs

4. jstack process PID | vim /hex thread PID – // For example: jstack 1040|vim /0x431 –

But for online problem locating, every second counts, and the above four steps are still too cumbersome and time-consuming. Oldratlee, who introduced Taobao before, encapsulated the above process into a tool: show-busy-java-threads. sh, you can easily locate this type of problem online:

我去,Linux 系统 CPU 100% 打满了!

It can be concluded that the execution CPU of a time tool method in the system is relatively high. After locating the specific method, check whether there are performance problems in the code logic.

※ If the online problem is more urgent, you can omit 2.1 and 2.2 and directly execute 2.3. The analysis here is from multiple angles just to present you with a complete analysis idea.

3. Root cause analysis

After the previous analysis and troubleshooting, we finally located a problem with time tools, which caused excessive server load and CPU usage.

  • Exception method logic: is to convert the timestamp into the corresponding specific date and time format;
  • Upper layer call: Calculate all the seconds from early morning to the current time, convert them into the corresponding format and put them into the set to return the result;
  • Logic layer: corresponds to the query logic of the real-time report of the data platform. The real-time report will come at a fixed time interval, and there will be multiple (n) method calls in one query.

Then it can be concluded that if the current time is 10 a.m. that day, the number of calculations for a query is 106060n times = 36,000n calculations, and As time goes by, the number of single queries increases linearly as it gets closer to midnight. Since a large number of query requests from modules such as real-time query and real-time alarm require calling this method multiple times, a large amount of CPU resources are occupied and wasted.

4. Solution

After locating the problem, the first consideration is to reduce the number of calculations and optimize the exception method. After investigation, it was found that when used at the logic layer, the contents of the set collection returned by this method were not used, but the size value of the set was simply used. After confirming the logic, simplify the calculation through a new method (current seconds - seconds in the early morning of the day), replace the called method, and solve the problem of excessive calculations. After going online, we observed the server load and CPU usage. Compared with the abnormal time period, the server load and CPU usage dropped by 30 times and returned to normal. At this point, the problem has been solved.

![Yesterday afternoon, I suddenly received an email alert from the operation and maintenance department, showing that the CPU utilization rate of the data platform server was as high as 98.94%. In recent times, this utilization rate has continued to be above 70%. At first glance, it seems that the hardware resources have reached a bottleneck and need to be expanded. But after thinking about it carefully, I found that our business system is not a highly concurrent or CPU-intensive application. This utilization rate is too exaggerated, and the hardware bottleneck cannot be reached so quickly. There must be a problem with the business code logic somewhere.

2. Troubleshooting ideas

2.1 Locate high load process pid

First log in to the server and use the top command to confirm the specific situation of the server, and then analyze and judge based on the specific situation.

我去,Linux 系统 CPU 100% 打满了!

By observing the load average and the load evaluation standard (8 cores), it can be confirmed that the server has a high load;

我去,Linux 系统 CPU 100% 打满了!

Observing the resource usage of each process, we can see that the process with process ID 682 has a higher CPU ratio

2.2 Locate specific abnormal business

Here we can use the pwdx command to find the business process path based on pid, and then locate the person in charge and the project:

我去,Linux 系统 CPU 100% 打满了!

It can be concluded that this process corresponds to the web service of the data platform.

2.3 Locate the abnormal thread and specific code lines

The traditional solution is generally 4 steps:

1. top oder by with P:1040 //First sort by process load and find maxLoad(pid)

2. top -Hp process PID: 1073 // Find the relevant load thread PID

3. printf "0x%x" Thread PID: 0x431 // Convert the thread PID to hexadecimal to prepare for later searching for jstack logs

4. jstack process PID | vim /hex thread PID – // For example: jstack 1040|vim /0x431 –

But for online problem locating, every second counts, and the above four steps are still too cumbersome and time-consuming. Oldratlee, who introduced Taobao before, encapsulated the above process into a tool: show-busy-java-threads. sh, you can easily locate this type of problem online:

我去,Linux 系统 CPU 100% 打满了!

It can be concluded that the execution CPU of a time tool method in the system is relatively high. After locating the specific method, check whether there are performance problems in the code logic.

※ If the online problem is more urgent, you can omit 2.1 and 2.2 and directly execute 2.3. The analysis here is from multiple angles just to present you with a complete analysis idea.

3. Root cause analysis

After the previous analysis and troubleshooting, we finally located a problem with time tools, which caused excessive server load and CPU usage.

  • Exception method logic: is to convert the timestamp into the corresponding specific date and time format;
  • Upper layer call: Calculate all the seconds from early morning to the current time, convert them into the corresponding format and put them into the set to return the result;
  • Logic layer: corresponds to the query logic of the real-time report of the data platform. The real-time report will come at a fixed time interval, and there will be multiple (n) method calls in one query.

Then it can be concluded that if the current time is 10 a.m. that day, the number of calculations for a query is 106060n times = 36,000n calculations, and As time goes by, the number of single queries increases linearly as it gets closer to midnight. Since a large number of query requests from modules such as real-time query and real-time alarm require calling this method multiple times, a large amount of CPU resources are occupied and wasted.

4. Solution

After locating the problem, the first consideration is to reduce the number of calculations and optimize the exception method. After investigation, it was found that when used at the logic layer, the contents of the set collection returned by this method were not used, but the size value of the set was simply used. After confirming the logic, simplify the calculation through a new method (current seconds - seconds in the early morning of the day), replace the called method, and solve the problem of excessive calculations. After going online, we observed the server load and CPU usage. Compared with the abnormal time period, the server load and CPU usage dropped by 30 times and returned to normal. At this point, the problem has been solved.

我去,Linux 系统 CPU 100% 打满了!

5. Summary

  • During the coding process, in addition to implementing business logic, we must also focus on optimizing code performance. The ability to realize a business requirement and the ability to achieve it more efficiently and more elegantly are actually two completely different manifestations of engineers' abilities and realms, and the latter is also the core competitiveness of engineers.
  • After the code is written, do more reviews and think more about whether it can be implemented in a better way.
  • Don’t miss any small detail in online questions! Details are the devil. Technical students need to have the thirst for knowledge and the spirit of pursuing excellence. Only in this way can they continue to grow and improve.

The above is the detailed content of Let me go, the Linux system CPU is 100% full!. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:lxlinux.net. If there is any infringement, please contact admin@php.cn delete