Example in Gotw 67

At http://www.gotw.ca/gotw/067.htm

int main() { double x = 1e8; //float x = 1e8; while( x > 0 ) { --x; } } 

When you change double to float, this is an infinite loop in VS2008. According to Gotw's explanation:

What if the float cannot accurately represent all integer values โ€‹โ€‹from 0 to 1E8? Then, the modified program will start counting, but will eventually reach a value of N that cannot be imagined and for which N-1 == N (due to insufficient floating point precision) ... and then the loop will remain at that value until those as long as the machine on which the program runs without power.

From what I understand, the IEEE754 float is the only precision (32 bits), and the float range should be +/- 3.4e +/- 38, and it should have 7 digits.

But I still donโ€™t understand exactly how this happens: "it will ultimately reach a value of N, which is impossible to imagine and for which N-1 == N (due to insufficient floating point precision)". Can someone try to explain this bit?

A bit of additional information: when I use double x = 1e8, it ends after about 1 second, when I change it to float x = 1e8, it works much longer (still works after 5 minutes), also if I changed it to float x = 1e7; He finished in about 1 second.

My test environment is VS2008.

BTW I am NOT asking for a basic explanation of the IEEE 754 format, as I already understand this.

thanks

+6
source share
4 answers

Well, for the sake of argument, let's assume that we have a processor that represents a floating point number with 7 significant decimal digits and a mantissa with, say, two decimal digits. So now the number 1e8 will be stored as

 1.000 000 e 08 

(where "." and "e" do not really need to be stored.)

So now you want to calculate "1e8 - 1". 1 is represented as

 1.000 000 e 00 

Now, to perform the subtraction, we first do the subtraction with infinite accuracy, then normalize it so that the first digit before the "." ranges from 1 to 9 and, finally, is rounded to the nearest representable value (with a break by even, say). The result of infinite accuracy "1e8 - 1" is equal to

 0.99 999 999 e 08 

or normalized

 9.9 999 999 e 07 

As you can see, the result of infinite accuracy requires one more digit in value than what our architecture provides; therefore, we need to round (and change to normalize) the infinitely accurate result to 7 significant digits, as a result of which

 1.000 000 e 08 

Therefore, you get "1e8 - 1 == 1e8" and your cycle never ends.

Now, in fact, you are using the IEEE 754 binary floats, which are slightly different, but the principle is about the same.

+8
source

The operation x-- (in this case) is equivalent to x = x - 1 . This means that the original value of x is taken, subtracted 1 (using infinite precision, as specified in IEEE 754-1985), and then the result is rounded to the next value in the float value space.

The rounding result for the numbers 1.0e8f + i given below for i in [-10;10] :

  -10: 9.9999992E7 (binary +|10011001|01111101011110000011111) -9: 9.9999992E7 (binary +|10011001|01111101011110000011111) -8: 9.9999992E7 (binary +|10011001|01111101011110000011111) -7: 9.9999992E7 (binary +|10011001|01111101011110000011111) -6: 9.9999992E7 (binary +|10011001|01111101011110000011111) -5: 9.9999992E7 (binary +|10011001|01111101011110000011111) -4: 1.0E8 (binary +|10011001|01111101011110000100000) -3: 1.0E8 (binary +|10011001|01111101011110000100000) -2: 1.0E8 (binary +|10011001|01111101011110000100000) -1: 1.0E8 (binary +|10011001|01111101011110000100000) 0: 1.0E8 (binary +|10011001|01111101011110000100000) 1: 1.0E8 (binary +|10011001|01111101011110000100000) 2: 1.0E8 (binary +|10011001|01111101011110000100000) 3: 1.0E8 (binary +|10011001|01111101011110000100000) 4: 1.0E8 (binary +|10011001|01111101011110000100000) 5: 1.00000008E8 (binary +|10011001|01111101011110000100001) 6: 1.00000008E8 (binary +|10011001|01111101011110000100001) 7: 1.00000008E8 (binary +|10011001|01111101011110000100001) 8: 1.00000008E8 (binary +|10011001|01111101011110000100001) 9: 1.00000008E8 (binary +|10011001|01111101011110000100001) 10: 1.00000008E8 (binary +|10011001|01111101011110000100001) 

So you can see that 1.0e8f and 1.0e8f + 4 , and some other numbers have the same representation. Since you already know the details of the IEEE 754-1985 floating point formats, you also know that the remaining digits should be rounded.

+3
source

What is the result of n - 1 if n - 1 and n have the same representation due to the approximate nature of the floating point number?

+1
source

Regarding the โ€œachievementโ€ of a value that is impossible to imagine, I think Herb included the possibility of completely esoteric floating point representations.

With any regular floating-point representations, you will either start at that value (i.e. stuck on the first value), or you will be somewhere in an adjacent range of integers centered around zero that can be represented exactly, so the countdown will be successful.

For IEEE 754, a 32-bit representation, usually a float in C ++, has 23 bits of mantissa, and a 64-bit representation, usually a double in C ++, has 52 bits of mantissa. This means that with double you can at least represent integers in the range - (2 ^ 52-1) ... 2 ^ 52-1. I'm not quite sure if the range can expand with a different factor of 2. I get a little dizzy when thinking about it. :-)

Cheers and hth.,

+1
source

Source: https://habr.com/ru/post/894628/


All Articles