What is the accuracy of floating point arithmetic?

Question

What is the accuracy of floating point arithmetic?

Consider two very simple multiplications below:

double result1; long double result2; float var1=3.1; float var2=6.789; double var3=87.45; double var4=234.987; result1=var1*var2; result2=var3*var4;

Are default multiplications performed with greater precision than operands? I mean, in the case of the first multiplication, this is done in double precision, and in the case of the second in the x86 architecture it is done in 80-bit extended precision, or should we distinguish operands in expressions to higher precision, as shown below?

 result1=(double)var1*(double)var2; result2=(long double)var3*(long double)var4;

What about other operations (addition, division and remainder)? For example, when adding more than two positive values of single precision, using additional significant bits with double precision can reduce rounding errors if used to store intermediate results of an expression.

+6

c ++ floating-point floating-point-precision rounding arithmetic-expressions

Pooria Aug 14 '14 at 7:35

source share

4 answers

Pascal cuoq · Answer 1 · 2014-08-14T08:33:20+0000

Floating point precision

C ++ 11 includes the definition of FLT_EVAL_METHOD from C99 to cfloat .

  FLT_EVAL_METHOD     

 Possible values:
 -1 undetermined
  0 evaluate just to the range and precision of the type
  1 evaluate float and double as double, and long double as long double.
  2 evaluate all as long double

If your compiler defines FLT_EVAL_METHOD as 2, then the calculations of r1 and r2 , as well as s1 and s2 below are equivalent:

 double var3 = …; double var4 = …; double r1 = var3 * var4; double r2 = (long double)var3 * (long double)var4; long double s1 = var3 * var4; long double s2 = (long double)var3 * (long double)var4;

If your compiler defines FLT_EVAL_METHOD as 2, then in all four calculations above, the multiplication is performed with precision of the type long double .

However, if the compiler defines FLT_EVAL_METHOD as 0 or 1, r1 and r2 , respectively s1 and s2 , are not always the same. Multiplications in the calculations of r1 and s1 are performed with precision double . Multiplications in the calculations of r2 and s2 are performed with the accuracy of long double .

Getting broad results from narrow arguments

If you compute results that are intended to be stored in a wider type of result than the type of operands, like result1 and result2 in your question, you should always convert the arguments to type at least as much as the goal, as you are here:

 result2=(long double)var3*(long double)var4;

Without this conversion (if you write var3 * var4 ), if the compiler definition FLT_EVAL_METHOD is 0 or 1, the product will be calculated with double precision, which is a shame, because it is intended to be stored in a long double .

If the compiler defines FLT_EVAL_METHOD as 2, then conversions to (long double)var3*(long double)var4 not needed, but they also do not hurt: the expression means exactly the same with them and without them.

Digression: if the destination format is as narrow as the arguments, then when is extended accuracy better for intermediate results?

Paradoxically, for one operation, it is best to round only once to the target accuracy. The only effect of calculating a single multiplication in extended precision is that the result will be rounded to extended precision, and then to double precision. This makes it less accurate . In other words, with FLT_EVAL_METHOD 0 or 1, the result of r2 above is sometimes less accurate than r1 due to double rounding, and if the compiler uses the IEEE 754 floating point, it is never better.

The situation is different for larger expressions that contain multiple operations. It is usually better for them to calculate intermediate results with extended precision, either using explicit conversions or because the compiler uses FLT_EVAL_METHOD == 2 . This question and its accepted answer show that when computing with 80-bit intermediate calculations with high precision for binary 64 arguments and IEEE 754 results, the interpolation formula u2 * (1.0 - u1) + u1 * u3 always gives the result between u2 and u3 for u1 between 0 and 1. This property may not be performed for intermediate calculations with binary 64 precision due to large rounding errors.

Robert Allan Hennigan Leahy · Answer 2 · 2014-08-14T07:46:49+0000

Before multiplication, division, and module, the usual conversions for floating point types are applied:

Normal arithmetic conversions are performed on operands and determine the type of result.
§5.6 [expr.mul]

Similarly for addition and subtraction:

Normal arithmetic conversions are performed for operands of an arithmetic or enumerated type.
§5.7 [expr.add]

Common arithmetic conversions for floating point types are outlined in the standard as follows:

Many binary operators that expect operands of arithmetic or an enumeration type cause conversions and give similar results. The goal is to give a generic type, which is also a result type. This pattern is called ordinary arithmetic conversions, which are defined as follows:
[...]
- If one of the operands is of type long double , the other must be converted to long double .
- Otherwise, if any operand is double , the other must be converted to double .
- Otherwise, if any operand is float , the other must be converted to float .
§5 [expr]

The actual shape / accuracy of these floating point types is determined by the implementation:

A double type provides at least the same precision as a float , and a long double type provides at least the same precision as a double . A set of values of type float is a subset of a set of values of type double ; a double value set is a subset of a long double value set. The representation of floating point type values is implementation-defined.
§3.9.1 [basic.fundamental]

Garp · Answer 3 · 2014-08-14T08:18:28+0000

For floating point multiplication: FP multipliers use the internal double operand width to generate an intermediate result that is equal to the real result with infinite precision, and then round it to the target accuracy. Thus, you do not have to worry about breeding. The result is correctly rounded.
To add a floating point, the result is also correctly rounded, since standard FP adders use an additional sufficient 3 security bits to calculate a correctly rounded result.
For division, remainder, and other complex functions, such as transcendental ones, such as sin, log, exp, etc., it depends mainly on the architecture and libraries used. I recommend that you use the MPFR library if you are looking for properly rounded results for division or any other complex function.

barak manos · Answer 4 · 2014-08-14T13:34:00+0000

Not a direct answer to your question, but for constant floating point values (such as those indicated in your question), the method that gives the least amount of precision loss will use a rational representation of each value as an integer numerator divided by an integer denominator, and performs as many integer multiplications as possible before the actual floating point division.

For the floating point values indicated in your question:

 int var1_num = 31; int var1_den = 10; int var2_num = 6789; int var2_den = 1000; int var3_num = 8745; int var3_den = 100; int var4_num = 234987; int var4_den = 1000; double result1 = (double)(var1_num*var2_num)/(var1_den*var2_den); long double result2 = (long double)(var3_num*var4_num)/(var3_den*var4_den);

If any of the integer products is too large to fit in an int , you can use larger integer types:

 unsigned int signed long unsigned long signed long long unsigned long long

What is the accuracy of floating point arithmetic?

Floating point precision

Getting broad results from narrow arguments

Digression: if the destination format is as narrow as the arguments, then when is extended accuracy better for intermediate results?

More articles: