Floating point (double) limits defined by the long double suffix L

Question

Floating point (double) limits defined by the long double suffix L

1 question:

I have a question about defining DBL_MAX and DBL_MIN on Linux using gcc v4.8.5.
They are defined in limit.h as:

 #define DBL_MAX __DBL_MAX__ #define DBL_MIN __DBL_MIN__

where __DBL_MIN__ and __DBL_MAX__ are compiler specific and can be obtained:

 $ gcc -dM -E - < /dev/null ... #define __DBL_MAX__ ((double)1.79769313486231570815e+308L) #define __DBL_MIN__ ((double)2.22507385850720138309e-308L) ...

My question is:
Why are values defined as long double with the suffix L and then discarded by double ?

2. Question:

Why is __DBL_MIN_10_EXP__ defined as -307 , but the minimum value is -308 , as used above in the DBL_MIN macro? In the case of the maximum indicator, it is determined using 308 , which I can understand, since it is used by the DBL_MAX macro.

 #define __DBL_MAX_10_EXP__ 308 #define __DBL_MIN_10_EXP__ (-307)

Not part of the question, the observations I just made:

By the way, with Windows with Visual Studio 2015, there are only DBL_MAX and DBL_MIN macros defined without redirecting the compiler to the underlined version. Further, the minimum positive double value DBL_MIN and the maximum double value DBL_MAX slightly larger than the values of my Linux gcc compiler (only compared to the specified macros from gcc v4.8.5 above):

 #define DBL_MAX 1.7976931348623158e+308 #define DBL_MIN 2.2250738585072014e–308

In addition, the Microsoft compiler sets long double restrictions on long double values, it seems that it does not support the real implementation of long double .

+5

gcc floating-point

Andre Kampling Jun 28 '17 at 12:22

source share

2 answers

I do not know why the suffix L. is used.

This site has an IEEE 754 floating point overview.

The exponent is 11 bits with an offset of 1023. However, exponents 0 and 2047 are reserved for special numbers. Thus, this means that the indicator can vary from 2046-1023 = 1023 to 1-1023 = -1022.

So, for the maximum normalized value, we have the indicator 2 ^ 1023. The maximum value for the mantissa is slightly below 2 (1.111, etc. From 52 1 s after the point, in binary format), which is ~ 2 * 2 ^ 1023 = ~ 1.79e308.

For the minimum normalized value, we have an indicator of 2 ^ -1022. The minimum mantissa is exactly 1, giving us the value 1 * 2 ^ -1022 = ~ 2.22e-308. So far so good.

DBL_MIN_10_EXP and DBL_MAX_10_EXP are the min / max indicators out of 10 that are normalized. For max 1e308 it is less than ~ 1.79e308, therefore the value is 308. For min 1e-308 it is too small - it is less than ~ 2.22e-308. 1e-307 is greater than ~ 2.22e-308, so the value is -307.

+1

Paul floyd Jun 28 '17 at 14:15

source share

chux · Accepted Answer · 2017-06-28T15:52:39+0000

Defining binary floating-point numbers in decimal format has subtle issues.

Why are values defined as long double with the suffix L, and then returned to double?

With typical binary64, the maximum final value is around 1.795e+308 or exactly.

 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368

The number of digits needed to convert to a unique double can be an integer DBL_DECIMAL_DIG (usually 17 and at least 10). In any case, the use of exponential notation is certainly obvious without excessive precision.

 /* 1 2345678901234567 */ // Sorted 1.79769313486231550856124... // DBL_MAX next smallest for reference 1.79769313486231570814527... // Exact 1.79769313486231570815e+308L // gcc 1.7976931348623158e+308 // VS (just a hair closer to exact than "next largerst") 1.7976931348623159077293.... // DBL_MAX next largest if not limited by range

Different compilers cannot convert this string exactly as they hoped. Sometimes ignoring some of the least significant digits - although this is controlled by the compiler.

Another source of subtle differences in conversions, and I expect that this is why "L" is added , the floating point processor, which may not have an exact binding to the IEEE Standards, affects the calculation of double . Worse, the result may be that the constant 1.797...e+308 converted to infinity due to minute errors of the "code to double " conversion using double math. When converting to long double , those long double conversion errors are very small. Then, converting the result of long double to double rounded to the expected number.

In short, forcing L math ensures that a constant will not be inadvertently made endlessly.

I would expect that the following, which does not comply with either gcc or VS, would be sufficient with the IEEE 754 FPU compliant standard.

 #define __DBL_MAX__ 1.7976931348623157e+308

Returning to double should do DBL_MAX a double . This will meet many code expectations that DBL_MAX is double , not a long double . I do not see a specification that requires this, however.

Why is DBL_MIN_10_EXP defined with -307, but the minimum is -308?

This should be consistent with the definition of DBL_MIN_10_EXP . "... the minimum negative integer such that 10 raised to this power is in the range of normalized floating point numbers." The integer answer is between -307 and -308, so the minimum integer in the range is -307.

part of the observation

Although VS treats a long double as a separate type, it uses the same encoding as a double , so there are no numerical advantages when using L

Floating point (double) limits defined by the long double suffix L

More articles: