Matlab Doubling Arithmetic Accuracy

I have a bit of trouble understanding how the precision of these doubles affects the result of arithmetic operations in Matlab. I thought that since both a and b are doubles, they will be able to perform operations to this precision. I understand that there may be a rounding error, but since these numbers are in good agreement with the 64-bit representation of numbers, I did not think this would be a problem.

a = 1.22e-45 b = 1 a == 0 ans = 0 %a is not equal to zero (a + b) == 1 ans = 1 

How is it capable of transferring enough accuracy to recognize a! = 0, but when added to 1 it does not show any changes.

+5
source share
3 answers

The 64-bit IEEE-754 floating-point numbers have sufficient accuracy (with a 53-bit mantissa) to represent about 16 significant decimal digits. But this requires more than 45 significant decimal digits for the difference between (1 + a) = 1.00 .... 000122 and 1.000 for your example.

+6
source

A floating point means exactly that: accuracy refers to the scale of the number itself.

In the specific example that you indicated, 1.22e-45 can be presented separately, since the indicator can be adjusted to represent 10 ^ -45 or approximately 2 ^ -150.

On the other hand, 1.0 is represented in binary format with a scale of 2 ^ 0 (i.e. 1).

To add these two values, you need to align their decimal points (er ... binary points), which means that all 1.22e-45 precision is shifted more than 150 bits to the right.

Of course, IEEE double-precision floating-point values ​​have only 53 bits of mantissa (precision), which means that on a scale of 1.0, 1.22e-45 is actually zero.

+6
source

To add other answers to the above, you can use the MATLAB EPS function to visualize the accuracy problem you are working with. For a given double-precision floating-point number, the EPS function will tell you the distance from it to the next largest floating-point number represented by the number:

 >> a = 1.22e-45; >> b = 1; >> eps(b) ans = 2.2204e-016 

Note that the next floating-point number that is greater than 1 is 1.00000000000000022204 ..., and the value of a does not even come close to half the distance between the two numbers. Consequently, a+b ends up with 1 remaining.

By the way, you can also find out why a is considered non-zero, even if it is so small if you look at the smallest representable double-precision floating-point value using the REALMIN function:

 >> realmin ans = 2.2251e-308 %# MUCH smaller than a! 
+3
source

Source: https://habr.com/ru/post/917522/


All Articles