Home >Java >javaTutorial >5 things you need to know to build high-performance Java applications
This article is excerpted from "java performance". Students who are more concerned about java performance probably know this book. Performance may be something that many students rarely care about when they write java code on a daily basis, but when we write code The process is indeed inseparable from the impact on program performance. As small as our use of bit operations to implement arithmetic operations, as large as our overall architecture design of JAVA code, performance is actually very close to us. This article mainly mentions several points, mainly some issues that we are more concerned about in the field of performance, and it is enlightening. If students are interested in performance, then we can study each point in depth together.
For performance tuning, there are usually three steps: 1. Performance monitoring; 2. Performance analysis; 3. Performance tuning
Our main focus on the performance of the operating system is the following points: CPU utilization, CPU Scheduling execution queues, memory utilization, network I/O, disk I/O.
1.CPU utilization
For an application, in order for the application to achieve the best performance and scalability, we must not only make full use of the available part of the CPU cycle, but also make the use of this part of the CPU more efficient. Be of value, not waste. Making full use of CPU cycles is very challenging for multi-threaded applications running on multi-processor and multi-core systems. In addition, when the CPU reaches saturation, it does not mean that the performance and scalability of the CPU have reached the optimal state. In order to distinguish how applications utilize CPU resources, we must detect it at the operating system level. On many operating systems, CPU utilization statistics reports usually include user and system or kernel usage of the operating system. User usage of the CPU is the time an application uses to perform application code execution. In contrast, kernel and system CPU usage refers to the time an application spends executing operating system kernel code locks. High kernel or system CPU usage can indicate a squeeze on shared resources or a large amount of I/O device interaction. In order to improve application performance and scalability, the ideal state is to allow the kernel or system CPU time to be 0%, because the time spent executing kernel or system code can be used to execute application code. Therefore, a correct direction for CPU usage optimization is to reduce the time the CPU spends executing kernel code or system code as much as possible.
For computing-intensive applications, performance monitoring is deeper than monitoring user CPU usage and kernel or system CPU usage. In computing-intensive applications, we need to monitor the number of executions within the CPU clock cycle (Instructions per clock; IPC ) or the CPU cycles used by each CPU execution (cycles per instruction; CPI). For computing-intensive applications, it is a good choice to monitor the CPU from these two dimensions, because the packaged CPU performance reporting tools of modern operating systems usually only print the CPU utilization, but not the CPU usage within the CPU cycle. The time to execute the instruction. This means that when the CPU is waiting for data in the memory, the operating system CPU performance reporting tool will also consider the CPU to be in use. We call this scenario "Stall". The "Stall" scenario often occurs, such as when the CPU Any time an instruction is being executed, as long as the data required by the instruction is not ready, that is, it is not in the register or CPU cache, a "Stall" scenario will occur.
When the "Stall" scenario occurs, the CPU will waste clock cycles because the CPU must wait for the data required by the instruction to arrive in the register or buffer. And in this scenario, it is normal for hundreds of CPU clock cycles to be wasted. Therefore, in computing-intensive applications, the strategy to improve performance is to reduce the occurrence of "Stall" scenarios or enhance the use of CPU cache to make it more efficient. Fewer CPU cycles are wasted waiting for data. This type of performance monitoring knowledge is beyond the content of this book and requires the help of a performance expert. However, the performance analysis tool Oracle Solaris Studio Performance Analyzer mentioned later will include such data.
2.CPU scheduling queue
In addition to monitoring CPU usage, we can also check whether the system is fully loaded by monitoring the CPU execution queue. The execution queue is used to store lightweight processes. These processes are usually ready for execution but are waiting for CPU scheduling and are in a state of waiting in the scheduling queue. When the number of lightweight processes that the current processor can handle increases, When there are many, scheduling queues will be generated. A deep CPU dispatch queue indicates that the system is fully loaded. The execution queue depth of the system is equal to the number of waits that cannot be executed by the virtual processor, and the number of virtual processors is equal to the number of hardware threads in the system. We can use Java's API to get the number of virtual processors, Runtime.avaliableProcessors(). When the execution queue depth is four times or more than the number of virtual processors, the operating system will become unresponsive.
A general guideline for detecting CPU scheduling queues is to pay attention when we find that the queue depth is higher than twice the number of virtual processes, but there is no need to take immediate action. When it is more than three times or four times or higher, you should pay attention and solve the problem without delay.
There are usually two optional ways to observe the depth of the queue. The first is to share the load by adding CPUs or reducing the load on existing CPUs. This approach essentially reduces the number of load threads per execution unit, thereby reducing the depth of the execution queue.
Another way is to increase CPU usage by profiling the applications running on the system. In other words, finding a way to reduce the CPU cycles spent on garbage collection, or finding better algorithms to use less CPU cycles to execute CPU instructions. Performance experts usually focus on the latter approach: reducing code execution path length and better CPU instruction selection. JAVA programmers can improve code execution efficiency through better execution algorithms and data structures.
3. Memory utilization
In addition to CPU usage, the memory attributes of the system also need to be monitored. These attributes include, for example: paging, swapping, locks, context switching caused by multi-threading, etc.
Swapping usually occurs when the memory required by the application is greater than the actual physical memory. To handle this situation, the operating system usually configures a corresponding area called the swap area. The swap area is usually located on the physical disk. When the application in the physical memory is exhausted, the operating system will temporarily swap part of the memory data to the disk space. This part of the memory area is usually the area with the lowest access frequency, without affecting the comparison. "Busy" memory area; when the memory swapped to the disk area is accessed by the application, it is necessary to read the memory from the disk swap area in units of pages. Swapping will affect the performance of the application.
The performance of the virtual machine's garbage collector is very poor during swapping, because most of the areas visited by the garbage collector are unreachable, that is, the garbage collector will cause swapping activities to occur. The scene is dramatic. If the garbage collected heap area has been swapped to disk space, swapping will occur in page units at this time, so that it can be scanned by the garbage collector. During the swapping process, garbage will be dramatically caused. The collection time of the collector is extended. At this time, if the garbage collector is "Stop The World" (making the application response stop), then this time will be extended.
4. Network I/O
The performance and scalability of distributed JAVA applications will be limited by network bandwidth and network performance. For example, if we send more packets to a network interface than it can handle, the packets will accumulate in the operating system's buffer, which will cause application delays, and other situations will also cause network application delays. .
Tools for differentiation and monitoring are often hard to find in operating system packaging tools. Although Linux provides the netstat command, both Linux and Solaris provide implementation of network usage. They both provide statistics including packet sending, receiving packets, error packets, conflicts and other information per second. In Ethernet, it is normal for a small number of packet collisions to occur. If there are many packet errors, there may be a problem with the network card. At the same time, although netstat can count the sending and receiving data of the network interface, it is difficult to determine whether the network card is fully utilized. For example, if netstat -i shows that 2500 packets are sent from the network card per second, but we still cannot determine whether the current network utilization is 100% or 1%, we can only know that there is currently traffic. This is only a conclusion that can be reached without knowing the network packet size. Simply put, we cannot use the netstat provided by Linux and Solaris to determine whether the current network affects performance. We need some other tools to monitor the network while our JAVA application is running.
5. Disk I/O
If the application operates on the disk, we need to monitor the disk to monitor possible disk performance problems. Some applications are I/O intensive, such as databases. The use of disks usually also exists in application log systems. Logs are usually used to record important information during system operation.