Is there a problem with floating point arithmetic?

Question

Consider the following code: 0.1+0.2==0.3-&gt;false0.1+0.2-&gt;0.30000000000000004 Why do these inaccuracies occur?

P粉041856955 · Answer

Hardware Designer’s Perspective

I thought I'd add some perspective from a hardware designer's perspective, since I design and build floating point hardware. Knowing the origin of the error may help understand what's going on in the software, and ultimately, I hope this helps explain why floating point errors occur and seem to accumulate over time.

1 Overview

From an engineering perspective, most floating point operations will have some error since the hardware performing floating point calculations only needs to have a one unit error below half the last bit. Therefore, most hardware will stop at a precision that only needs to produce less than half the error of the last bit in a single operation, which is especially tricky in floating point division. What constitutes a single operation depends on the number of operands accepted by the unit. For most units, it's two, but some units accept 3 or more operands. Therefore, there is no guarantee that repeated operations will produce ideal errors, as these errors will accumulate over time.

2. Standard

Most processors follow the IEEE-754 standard, but some use denormalized or different standards. For example, there is a denormalization mode in IEEE-754 that allows very small floating point numbers to be represented at the expense of precision. However, the following describes the normalized mode of IEEE-754, which is the typical operating mode.

In the IEEE-754 standard, hardware designers can choose any error/epsilon value as long as it is less than half the unit of the last bit, and the result only needs to be less than half the unit of the last bit in one operation. This explains why errors accumulate over repeated operations. For IEEE-754 double precision, this is bit 54 because there are 53 bits used to represent the numeric part (normalized part) of the floating point number, also called the mantissa (e.g. 5.3 in 5.3e5). The next few sections will look in more detail at the causes of hardware errors in various floating-point operations.

3. Causes of division and rounding errors

The main cause of error in floating point division is the division algorithm used to calculate the quotient. Most computer systems use the inverse of multiplication to compute division, primarily in Z=X/Y and Z = X * (1/Y). The division is calculated iteratively, i.e. some number of digits of the quotient are calculated each cycle until the required accuracy is reached, which for IEEE-754 is anything less than half the error of the last digit. The reciprocal table of Y (1/Y) is called the quotient selection table (QST) in slow division. The number of digits in the quotient selection table is usually the width of the base, or the number of quotients calculated for each iteration plus a few guard bits. . For double precision (64 bits) of the IEEE-754 standard, it will be the base size of the divider plus a few guard bits k, where k>=2. So, for example, a typical quotient selection table that computes quotients 2 bits at a time (base 4) would be 2 2 = 4 bits (plus a few optional bits).

3.1 Division rounding error: Approximation of reciprocal

The reciprocal in the quotient selection table depends on the division method: slow division (such as SRT division) or fast division (such as Goldschmidt division); each entry is modified according to the division algorithm to minimize the error . Regardless, all reciprocals are approximations of the actual reciprocals, and certain errors are introduced. Both the slow division and fast division methods calculate the quotient iteratively, that is, each step calculates a certain number of quotient digits, then subtracts the result from the dividend, and the divisor repeats these steps until the error is less than half of the last digit. Slow division methods compute a fixed number of quotient digits in each step and are generally cheaper, whereas fast division methods compute a variable number of quotient digits in each step and are generally more expensive. The most important part about division methods is that most of them rely on repeated multiplication by approximating the reciprocal of , so they are error-prone.

4. Rounding errors in other operations: truncation

Another cause of rounding errors in all operations is the different truncation modes allowed by IEEE-754. There are truncate, round toward zero, round to nearest (default) , round down, and round up. All methods introduce an error of less than half a unit of the last bit for a single operation. Over time and repeated operations, truncation also accumulates into the resulting error. This truncation error is particularly troublesome in exponential operations involving some form of repeated multiplication.

5. Repeat operation

Because the hardware performing floating point calculations only needs to produce results with less than half the error of the last bit of a unit in a single operation, if not observed, the error will increase as the operation is repeated. This is why in calculations that require bounded errors, mathematicians use methods such as using IEEE-754's Nearest Rounding to Even Numbers, because the errors are more likely to cancel each other out over time, and Interval arithmetic combined with a variant of the IEEE 754 rounding mode to predict rounding errors and correct them. Rounding to the nearest even digit (at the last bit) is the default rounding mode for IEEE-754 due to the lower relative error compared to other rounding modes.

Please note that the default rounding mode, rounding to the nearest even digit, guarantees an operation with an error less than half a unit of the last digit. Using only truncation, round up, and round down can result in an error greater than half a unit of the last digit but less than one unit of the last digit, so unless you are calculating on an interval

P粉041881924 · Answer

BinaryFloating point numberThe operation is like this. In most programming languages, it is based on the IEEE 754 standard. The crux of the matter is that numbers are represented in this format as an integer times a power of two; a rational number whose denominator is not a power of two (such as 0.1, which is 1/10) cannot be accurately represented.

For 0.1 in the standard binary64 format, the representation can be written exactly as

0.1000000000000000055511151231257827021181583404541015625 (decimal), or
0x1.999999999999ap-4 (C99 hexadecimal floating point notation).

In contrast, the rational number 0.1, which is 1/10, can be written exactly as

0.1 (decimal), or
0x1.99999999999999...p-4 (similar to C99 hexadecimal floating point notation, where ... represents an endless sequence of 9s).

The constants 0.2 and 0.3 in your program will also be approximations of their true values. Exactly 0.2 The closest double is greater than the rational number 0.2, but the closest double is smaller than the rational number 0.3. The sum of 0.1 and 0.2 ends up being greater than the rational number 0.3, so it is inconsistent with the constants in the code.

A fairly comprehensive treatment of floating-point arithmetic is "What Computer Scientists Should Know About Floating-Point Arithmetic". For a more understandable explanation, see floating-point-gui.de.

The same problem exists with ordinary decimal (base 10) numbers, which is why a number like 1/3 ends up as 0.333333333...

You just encountered a number (3/10) that is easy to represent in the decimal system, but cannot be represented in the binary system. The situation is also reversed (in a way): 1/16 is an ugly number in decimal (0.0625), but looks neat in binary like 1/10000 in decimal (0.0001)** - If we were used to using the binary number system in our daily lives, you would even look at that number and instinctively understand that you could get it by constantly folding in half.

Of course, this is not exactly how floating point numbers are stored in memory (they use a form of scientific notation). However, it does illustrate the problem that binary floating point precision errors tend to arise because the "real world" numbers we are usually interested in tend to be powers of ten - but only because we use a decimal number system on a daily basis. This is why we say 71% instead of "5 out of every 7" (71% is an approximation, since no decimal number can represent 5/7 exactly).

So no, binary floating point numbers are not broken, they are just imperfect like any other base N number system :)

In practice, this precision issue means that you need to use a rounding function to round the floating point number to the number of decimal places you are interested in before displaying it.

You also need to replace the equality test with a comparison that allows a certain tolerance, which means:

Don’t use if (x == y) { ... }

Instead use if (abs(x - y) < myToleranceValue) { ... }.

Where abs is the absolute value function. The myToleranceValue needs to be chosen based on your specific application - this has a lot to do with how much "wiggle room" you are prepared to allow, and the maximum number you are comparing to (due to precision loss issues). Please note the "epsilon" style constants in your chosen language. These constants can be used as tolerance values, but their effectiveness depends on the size of the numbers you are working with (since large number calculations may exceed the epsilon threshold).