Home >Backend Development >PHP Tutorial >Weird precision diff tracing_PHP tutorial
1. Problems found in Query-diff test
Query-diff is a commonly used testing method on the retrieval end. The idea is to use A set of the same retrieval information respectively requests the baseline version and the test version of a system or module. Typically, there are only minor differences (program functionality/configuration, etc.) between the baseline version and the version under test. After sending the request, compare the search results returned by the two versions to verify whether the difference affects the final calculation result.
The tested module A in this case is written in C, and the output core data is a single-precision floating point number, recorded as Q.
When performing the query-diff test after a certain upgrade of module A, it was found that there is a precision diff in the Q value, the proportion is about 1%, the maximum diff is in the decimal place, and this upgrade is expected to be diff-free.
2. In-depth investigation
Usually when diff occurs, you must first clarify the direction of the investigation. If you cannot see the reason at a glance, you need to use the elimination method to verify the suspects one by one, narrow the scope, and reduce the Unnecessary investment of energy. So two major investigation directions are listed: environment or program.
Look at the environment first:
l Carefully checked the configuration and vocabulary of the old and new environments at the environment site, and they were in line with expectations, excluding factors related to environment construction tools.
l Since this upgrade is forward compatible, the configuration and vocabulary of the old and new environments are unified, retested, and diff reproduced, eliminating configuration differences.
There seems to be no problem with the environment, let’s go back to the verification process:
l Since multiple sets of tests have been done, the verification results have not changed, ruling out the possibility of random strategy diff.
l Print the debug log and check the intermediate results of each step in the processing. There are no problems. Only diff appears in the last step of calculating the Q value. Thread dirty data, process-level cache dirty data and variable types are successively excluded. Conversion and other risk points.
l For complete confirmation, directly replace the programs in the old and new environments with the new versions and retest. If it is really caused by the program, there should be no diff. However, the diff reappears! Obviously there is no random diff? ! !
At this time, the bottleneck has been identified. The reasons for the environment and the program seem to be wrong.
Calm down and think again. The previous investigation explained the concept of environment as the configuration and vocabulary used. It was believed that if the two are the same, the environment is the same. This is one-sided. The meaning of environment should also include the compilation environment and running environment of the system and hardware. So we have a new verification idea:
l Both the old and new versions of the program are produced using the company's cloud compilation cluster, so there should be no problem. However, to avoid taking things for granted, we carefully checked the compilation parameters and re-run them on the same local machine. Compiled the old and new versions, confirmed the diff recurrence, and eliminated compilation factors;
l Copy the old and new environments to the same machine, repress the request, and the diff disappears! Confirmed to be a factor in the operating environment
The operating environment includes the operating system and hardware levels. Strike while the iron is hot and continue to investigate:
l Confirm that the operating systems of the two machines where the diff appears are consistent, both are centos 4.3, and are ruled out Operating system;
l The difference in hard disk and memory models is less likely to cause diff, so we will not verify it yet;
l The CPU version of the machine where the new environment is located is Xeon E5645, and the CPU version of the machine where the old environment is located Xeon E5-2620, suspecting that the CPU model is different, I found another machine with the same CPU as the old environment to deploy the new environment, retested, the diff disappeared, and the target was locked to the CPU.
2. Revealing the truth
After analyzing the CPU, after simply excluding the number of cores, the maximum number of threads, and the first, second and third level caches, the instruction set differences in the CPU feature list caught my attention. .
Supplementary knowledge 1: The role of the cpu instruction set
The instruction set is a hard program stored inside the CPU that guides and optimizes CPU operations. With these instruction sets, the CPU can run more efficiently. To explain how instruction sets are optimized, two technologies have to be mentioned: SISD (Single Instruction Single Data) and SIMD (Single Instruction Multiple Data).
Take the addition instruction as an example. After using the SISD CPU to decode the addition instruction, the execution unit first accesses the memory to obtain the first operand, and then accesses the memory again to obtain the second operand. Only then can the summation operation be performed. In a CPU using SIMD, after the instruction is decoded, several execution units access the memory at the same time and obtain all the operands at once for operation. This feature makes SIMD particularly suitable for data-intensive operations.
The SSE series and AVX in the CPU instruction set are used for floating point operations, and AVX is one of the differences between the two CPUs, which is highly suspicious. Now we need to find evidence that the program is optimized using AVX.
However, there is no directly optimized code logic in the ASQ module. Although the program involving Q value calculation calls the static libA interface, the libA code does not use the instruction set. However, libA compiled static libB, so we traced all the way to the bottom layer and found that the fourth layer of compilation dependencies was libX provided by IDL. The code was confidential and could not be viewed.
I had to ask the relevant RD for advice. The RD informed that libX does use SSE instruction optimization and the math function library MKL provided by Intel, but does not use AVX.
Is this another dead end? With the last bit of hope, I checked MKL’s official introduction on Intel and found an unexpected gain: AVX optimization was introduced in MKL! 【1】
Now we have the last step to confirm that AVX is the culprit of the diff source. Soon, further evidence was found in Intel's products [2]:
The FMA instructions in AVX2 involve floats in matrix multiplication, dot product, polynomial evaluation, etc. The efficiency and accuracy of point operations have been improved compared to previous instruction sets, because FMA can complete multiplication and accumulation operations at one time. I also found posts from relevant technical personnel in the official forum to support [3]:
Supplementary knowledge two: floating point number storage methods in computers
float and double They all comply with IEEE specifications in terms of storage methods. Float complies with IEEE R32.24, and double complies with R64.53.
Whether it is single precision or double precision, storage is divided into three parts:
1. Sign bit (Sign): 0 represents positive, 1 represents negative
2. Exponent bit (Exponent): used to store exponent data in scientific notation, and uses shift storage
3. Mantissa part (Mantissa): mantissa part
where float The storage method is as shown in the following table:
|
Total length |
Mantissa part |
Exponent part |
Sign bit |
Single precision |
32bit |
0-22 |
23-30 |
31 |
Double precision |
64bit |
0-51 |
52-62 |
63 |
Extended Double | 80bit |
0-63 |
64-78 |
79 |
At the hardware level, the floating point operation logic of the CPU is implemented on the FPU (Floating Point Operation Unit) (whether SSE or AVX). The default calculation precision of the FPU is 80bit, while the float precision output by SSE and AVX Not that high (both are 32bit). If there are differences in calculation accuracy in the FPU (provided that they are both greater than 32bit), the calculated output is truncated to 32bit and then stored in the memory, which will inevitably cause a diff in the result due to approximate truncation.
Since Intel’s underlying algorithm is confidential, we can only guess that the FPU accuracy set when implementing the optimization functions of AVX and SSE is different, but the conclusion of the accuracy difference is certain.
The truth has emerged at this time: AVX’s FMA has 1 bit more accuracy than SSE. When there are iterative calculations, the difference will accumulate. The generation of the Q value undergoes complex matrix operations, and this tiny 1-bit difference is magnified to ten thousandths of a decimal point. At the same time, Intel ensures the compatibility of various machines. MKL code will be downgraded to SSE when running on a CPU that does not support AVX.
Supplementary knowledge three: Methods of using SSE and AVX to optimize programs
Still taking the addition instruction as an example, the introduction of relevant header files and preparation of compilation instructions will not be introduced here. Please refer to Related information.
Basic version:
Simple loop to accumulate and sum.
SSE optimized version
SSE register 128bit, 16 bytes, can store 4 single-precision floating point numbers at a time, and can be stored in groups of 4 Register, use the built-in addition function to sum, then add the 4 group sums, and finally add the remaining items of the group to get the final result.
AVX optimized version
AVX optimization method is similar to SSE, but the AVX register uses 256bit, 32 bytes, and can store 8 single-precision floating point numbers. Each group of 8 floats needs to be stored in the register.
Now randomly generate the input array and write a simple test case to verify the effect of optimization. The following is a performance comparison of the three algorithms. The unit is cumulative per second. The number of floats. As a result, SSE efficiency is increased to 4 times that of the regular version, while AVX is 8 times higher! 【4】
2. Summary and Enlightenment
Problem Summary:
l During the Query-diff compatibility test, it was found that module A is new or old There is a diff in the Q value calculated by the version;
l After investigation, it is determined that the accuracy diff comes from the floating point instruction set difference (AVX/SSE) supported by the program's running environment CPU
l In this case The proportion and absolute value of diff are both small. Although it currently does not affect online services, if the algorithm is further complicated and diff accumulates to the percentile, it will cause the strategy to fail.
l If the floating-point number operations of other modules use instruction set optimization, you also need to check whether the same problem exists.
Solution:
l When allocating test resources, ensure that the CPU of the machine where the new and old environments are located is consistent;
l Add an environment check mechanism before executing query-diff, and confirm again that the hardware is intact Difference;
l When deploying services online, you also need to make sure that the machine supports the AVX instruction set to achieve optimal performance and accuracy;
l Check whether other modules have similar use of instruction set optimization to avoid risks in advance.
Inspiration and suggestions:
l Floating-point operation-intensive programs can consider using instruction set functions such as SSE/AVX to optimize performance, which can usually significantly improve operating efficiency (SSE: 4 times, AVX : 8 times);
l When using the instruction set, pay attention to controlling the number of iterations (that is, the output of the instruction set function is used as the input of the instruction set function again) to avoid accumulating precision diffs to a level that cannot be ignored;
l Query-diff testing can be applied to more compatibility testing scenarios, such as comparing the impact of underlying system and hardware differences on applications such as CPU, operating system, and basic libraries.
Software engineering is inseparable from hardware support. Differences in compilation and running environments may cause differences in service performance and final calculation results. Such issues require special attention at all stages of development, testing, and launch. It is important to be a programmer who combines software and hardware!
Reference materials:
【1】 https://software.intel.com/zh-cn/articles/whats-new-in-intel-mkl
【 2】 https://software.intel.com/zh-cn/articles/intel-xeon-processor-e7-88004800-v3-product-family-technical-overview
【3】 https:// software.intel.com/en-us/forums/topic/507004
【4】 http://www.cnblogs.com/zyl910/archive/2012/10/22/simdsumfloat.html