Do floating, double and long doubles have a guaranteed minimum accuracy?

Question

Do floating, double and long doubles have a guaranteed minimum accuracy?

From my previous question, “ Is floating point precision variable or invariant? ” I got an answer that said

C provides DBL_DIG, DBL_DECIMAL_DIG and their float and long double doubles. DBL_DIG indicates the minimum relative decimal precision. DBL_DECIMAL_DIG can be considered the maximum relative decimal precision.

I looked up these macros. They are in the <cfloat> header. On the cplusplus page , they list macros for float , double and long double .

Here are macros for minimum precision values.

FLT_DIG 6 or greater
DBL_DIG 10 or greater
LDBL_DIG 10 or greater

If I took these macros at face value, I would suggest that a float has a minimum decimal precision of 6, and double and long double have a minimum decimal precision of 10. However, being a big boy, I know that some things can be too good, to be true.

Therefore, I would like to know. Floats, doubles, and long doubles guaranteed minimum decimal precision, and is this minimum decimal precision the macro values given above?

If not, why?

Note: Suppose we use the C ++ programming language.

+6

c ++ floating-point language-lawyer floating-point-precision minimum

Bryan Jun 2 '15 at 5:10

source share

4 answers

Floating, double, and long doubles had a guaranteed minimum decimal precision, and is this minimum decimal precision the macro values given above?

I cannot find a place in a standard that guarantees minimum values for decimal precision.

The following quote from http://en.cppreference.com/w/cpp/types/numeric_limits/digits10 may be helpful:

Example
An 8-bit binary type can accurately represent any two-digit decimal number, but three-digit decimal numbers 256..999 cannot be represented. The digits10 value for the 8-bit type is 2 ( 8 * std::log10(2) is 2.41)
The standard 32-bit floating-point IEEE 754 type has a 24-bit fractional part (23 bits, written, one implied), which may suggest that it can represent seven-digit decimal places ( 24 * std::log10(2) equals 7.22) , but the relative rounding of the error is uneven, and some floating-point values with 7 decimal digits do not withstand conversion to a 32-bit float and vice versa: the smallest positive example is 8.589973e9 , which becomes 8.589974e9 after the reverse transition. These rounding errors cannot exceed one bit in the view, and digits10 calculated as (24-1)*std::log10(2) , which is 6.92. Rounding results in the value of 6.

However, standard C indicates the minimum values that must be maintained. From standard C:

5.2.4.2.2 Characteristics of floating types
...
9 The values indicated in the following list should be replaced by constant expressions with implementation-defined values that are greater than or equal in magnitude (in absolute value) to those shown with the same sign
...
is the number of decimal digits, q, so that any floating point number with q decimal digits can be rounded to a floating point number with digits p radix b and vice versa without changing to q decimal digits,
...
FLT_DIG 6
DBL_DIG 10
LDBL_DIG 10

0

R sahu Jun 2 '15 at 5:32

source share

The C ++ standard says nothing about restrictions on floating point types. You can interpret the inclusion of standard C “by reference” as you wish, but if you accept the restrictions indicated there (N1570), section 5.2.4.2.2, subclause 15:

EXAMPLE 1 The following describes an artificial representation of a floating point that meets the minimum requirements of this International Standard and the corresponding values in the header for the type Float:
FLT_RADIX 16
FLT_MANT_DIG 6
FLT_EPSILON 9.53674316E-07F
FLT_DECIMAL_DIG 9
FLT_DIG 6
FLT_MIN_EXP -31
FLT_MIN 2.93873588E-39F
FLT_MIN_10_EXP -38
FLT_MAX_EXP +32
FLT_MAX 3.40282347E+38F
FLT_MAX_10_EXP +38

In this section, float , double and long double have these properties at least *.

0

Jared mulconry Jun 2 '15 at 6:04

source share

To be more specific. Since my compiler uses the IEEE 754 standard, then the precision of decimal digits will be from 6 to 9 significant decimal digits for float and from 15 to 17 significant decimal digits for double . Also, since the long double in my compiler is the same size as double , it also has 15 to 17 significant decimal digits.

These ranges can be checked from IEEE 754 double-precision double-precision: binary32 and IEEE 754 double-precision binary floating-point format: binary64, respectively.

0

Bryan Jun 2 '15 at 16:31

source share

Cheers and hth. · Accepted Answer · 2015-06-02T05:48:35+0000

If std::numeric_limits< F >::is_iec559 true, then the IEEE 754 standard guarantees apply to the floating point type F.

Otherwise (and in any case), the minimum permissible values of characters, such as DBL_DIG defined by the C standard, which, undeniably for the library, is "included in the [c ++, C] International Standard by reference", as indicated in C ++ 11 § 17.5.1.5/1.

Edit : As TC noted in the comment here,

" <climits> and <cfloat> are normatively included in §18.3.3 [c.limits], the minimum values are indicated in turn in clause 5.2.4.2.2 C standard

Unfortunately, for a formal presentation, first of all, a quote from C ++ 11 from section 17.5, which is only informative and not normative . And, secondly, the wording in standard C that the values indicated here are minimal is also found in the section (standard C99 of Appendix E), which is informative and not normative. Therefore, although this can be seen as a guarantee in practice, it is not a formal guarantee.

~~One of the strongest signs that the minimum accuracy in the work for float is 6 decimal digits, that the implementation will not give less:~~

default output operations with an accuracy of 6, and this is normative text.

~~Disclaimer: There may be additional language that provides warranties that I have not noticed. Not very likely, but possible.~~

Do floating, double and long doubles have a guaranteed minimum accuracy?

More articles: