Floating point accuracy calculation (K & R 2-1)

I found Stevens Computing Services - K and R Exercise 2-1 a very thorough answer K & R 2-1. This piece of complete code calculates the maximum value of type float in the C programming language.

Reluctantly my theoretical understanding of float values ​​is very limited. I know that they consist of significant (mantissa ..) and a value that is equal to 2.

 #include <stdio.h> #include <limits.h> #include <float.h> main() { float flt_a, flt_b, flt_c, flt_r; /* FLOAT */ printf("\nFLOAT MAX\n"); printf("<limits.h> %E ", FLT_MAX); flt_a = 2.0; flt_b = 1.0; while (flt_a != flt_b) { flt_m = flt_b; /* MAX POWER OF 2 IN MANTISSA */ flt_a = flt_b = flt_b * 2.0; flt_a = flt_a + 1.0; } flt_m = flt_m + (flt_m - 1); /* MAX VALUE OF MANTISSA */ flt_a = flt_b = flt_c = flt_m; while (flt_b == flt_c) { flt_c = flt_a; flt_a = flt_a * 2.0; flt_b = flt_a / 2.0; } printf("COMPUTED %E\n", flt_c); } 

I understand that the last part basically checks to what strength 2 you can raise the value using three variable algorithms. How about the first part?

I see that the progression of multiples of 2 should ultimately determine the meaning of the sign, but I tried to trace a few small numbers to check how it should work, and he was not able to find the correct values ​​...

==================================================== ======================

What are the concepts on which this program is based, and this program is becoming more accurate as you need to find longer and non-integer numbers?

+6
source share
1 answer

The first cycle determines the number of bits contributing to the value by finding the least power 2, so adding 1 to it (using floating point arithmetic) cannot change its value. If the value of n th is equal to two, then the value uses the bit n , because with n bits you can express all integers from 0 to 2 ^ n - 1, but not 2 ^ n . Therefore, the floating point representation 2 ^ n should have the exponent large enough so that the number of (binary) units is not significant.

Thus, having found the first degree 2, whose representation of float worse than unit precision, the maximum value of float , which has unit precision, is less than unity. This value is written to the flt_m variable.

Then the second cycle checks the maximum indicator, starting with the maximum value of the unit of measure and repeatedly increasing it (thereby increasing the indicator by 1) until it finds that the result cannot be converted back by halving it. The maximum float value is the value before this final doubling.

Note, by the way, that all of the above assumes a base-2 floating point representation. You are unlikely to encounter anything else, but C does not actually require any specific representation.

Regarding the second part of your question,

Does this program become more accurate as you need to find longer and non-integer numbers?

the program will take care to avoid loss of accuracy. It accepts the binary floating point representation that you described, but it will work correctly, regardless of the number of bits in the value or exponent of such a representation. Non-integer numbers are not involved, but the program already has numbers that are worse than unit precision and with numbers larger than can be represented by int .

+3
source

Source: https://habr.com/ru/post/982230/


All Articles