Floating point (double) limits defined by the long double suffix L

1 question:

I have a question about defining DBL_MAX and DBL_MIN on Linux using gcc v4.8.5.
They are defined in limit.h as:

 #define DBL_MAX __DBL_MAX__ #define DBL_MIN __DBL_MIN__ 

where __DBL_MIN__ and __DBL_MAX__ are compiler specific and can be obtained:

 $ gcc -dM -E - < /dev/null ... #define __DBL_MAX__ ((double)1.79769313486231570815e+308L) #define __DBL_MIN__ ((double)2.22507385850720138309e-308L) ... 

My question is:
Why are values ​​defined as long double with the suffix L and then discarded by double ?

2. Question:

Why is __DBL_MIN_10_EXP__ defined as -307 , but the minimum value is -308 , as used above in the DBL_MIN macro? In the case of the maximum indicator, it is determined using 308 , which I can understand, since it is used by the DBL_MAX macro.

 #define __DBL_MAX_10_EXP__ 308 #define __DBL_MIN_10_EXP__ (-307) 

Not part of the question, the observations I just made:

By the way, with Windows with Visual Studio 2015, there are only DBL_MAX and DBL_MIN macros defined without redirecting the compiler to the underlined version. Further, the minimum positive double value DBL_MIN and the maximum double value DBL_MAX slightly larger than the values ​​of my Linux gcc compiler (only compared to the specified macros from gcc v4.8.5 above):

 #define DBL_MAX 1.7976931348623158e+308 #define DBL_MIN 2.2250738585072014e–308 

In addition, the Microsoft compiler sets long double restrictions on long double values, it seems that it does not support the real implementation of long double .

+5
source share
2 answers

Defining binary floating-point numbers in decimal format has subtle issues.

Why are values ​​defined as long double with the suffix L, and then returned to double?

With typical binary64, the maximum final value is around 1.795e+308 or exactly.

 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368 

The number of digits needed to convert to a unique double can be an integer DBL_DECIMAL_DIG (usually 17 and at least 10). In any case, the use of exponential notation is certainly obvious without excessive precision.

 /* 1 2345678901234567 */ // Sorted 1.79769313486231550856124... // DBL_MAX next smallest for reference 1.79769313486231570814527... // Exact 1.79769313486231570815e+308L // gcc 1.7976931348623158e+308 // VS (just a hair closer to exact than "next largerst") 1.7976931348623159077293.... // DBL_MAX next largest if not limited by range 

Different compilers cannot convert this string exactly as they hoped. Sometimes ignoring some of the least significant digits - although this is controlled by the compiler.

Another source of subtle differences in conversions, and I expect that this is why "L" is added , the floating point processor, which may not have an exact binding to the IEEE Standards, affects the calculation of double . Worse, the result may be that the constant 1.797...e+308 converted to infinity due to minute errors of the "code to double " conversion using double math. When converting to long double , those long double conversion errors are very small. Then, converting the result of long double to double rounded to the expected number.

In short, forcing L math ensures that a constant will not be inadvertently made endlessly.

I would expect that the following, which does not comply with either gcc or VS, would be sufficient with the IEEE 754 FPU compliant standard.

 #define __DBL_MAX__ 1.7976931348623157e+308 

Returning to double should do DBL_MAX a double . This will meet many code expectations that DBL_MAX is double , not a long double . I do not see a specification that requires this, however.

Why is DBL_MIN_10_EXP defined with -307, but the minimum is -308?

This should be consistent with the definition of DBL_MIN_10_EXP . "... the minimum negative integer such that 10 raised to this power is in the range of normalized floating point numbers." The integer answer is between -307 and -308, so the minimum integer in the range is -307.

part of the observation

Although VS treats a long double as a separate type, it uses the same encoding as a double , so there are no numerical advantages when using L

+5
source

I do not know why the suffix L. is used.

This site has an IEEE 754 floating point overview.

The exponent is 11 bits with an offset of 1023. However, exponents 0 and 2047 are reserved for special numbers. Thus, this means that the indicator can vary from 2046-1023 = 1023 to 1-1023 = -1022.

So, for the maximum normalized value, we have the indicator 2 ^ 1023. The maximum value for the mantissa is slightly below 2 (1.111, etc. From 52 1 s after the point, in binary format), which is ~ 2 * 2 ^ 1023 = ~ 1.79e308.

For the minimum normalized value, we have an indicator of 2 ^ -1022. The minimum mantissa is exactly 1, giving us the value 1 * 2 ^ -1022 = ~ 2.22e-308. So far so good.

DBL_MIN_10_EXP and DBL_MAX_10_EXP are the min / max indicators out of 10 that are normalized. For max 1e308 it is less than ~ 1.79e308, therefore the value is 308. For min 1e-308 it is too small - it is less than ~ 2.22e-308. 1e-307 is greater than ~ 2.22e-308, so the value is -307.

+1
source

Source: https://habr.com/ru/post/1269290/


All Articles