C ++ (double) 0.700 * int (1000) => 699 (Not a problem with double precision)

Question

C ++ (double) 0.700 * int (1000) => 699 (Not a problem with double precision)

using g ++ (Ubuntu / Linaro 4.6.3-1ubuntu5) 4.6.3

I tried various typginging scaledvalue2 , but only until I saved the multiplication in the double variable and then in int could I get the desired result .. but I can not explain why ???

I know that double precession (0.699999999999999999555910790149937383830547332763671875) is a problem, but I don’t understand why one way is OK and the other is not?

I would expect both to fail if accuracy is a problem.

I DO NOT NEED a solution to fix it .. but just WHY ??? (IS issue fixed)

 void main() { double value = 0.7; int scaleFactor = 1000; double doubleScaled = (double)scaleFactor * value; int scaledvalue1 = doubleScaled; // = 700 int scaledvalue2 = (double)((double)(scaleFactor) * value); // = 699 ?? int scaledvalue3 = (double)(1000.0 * 0.7); // = 700 std::ostringstream oss; oss << scaledvalue2; printf("convert FloatValue[%f] multi with %i to get %f = %i or %i or %i[%s]\r\n", value,scaleFactor,doubleScaled,scaledvalue1,scaledvalue2,scaledvalue3,oss.str().c_str()); }

or briefly:

 value = 0.6999999999999999555910790149937383830547332763671875; int scaledvalue_a = (double)(1000 * value); // = 699?? int scaledvalue_b = (double)(1000 * 0.6999999999999999555910790149937383830547332763671875); // = 700 // scaledvalue_a = 699 // scaledvalue_b = 700

I cannot understand what is wrong here.

Output:

 convert FloatValue[0.700000] multi with 1000 to get 700.000000 = 700 or 699 or 700[699]

vendor_id: GenuineIntel
cpu family: 6
model: 54
Model Name: Intel (R) Atom (TM) CPU N2600 @ 1.60 GHz

+5

c ++ double floating-point precision

Ratman Nov 03 '16 at 12:02

source share

5 answers

Since the x86 floating point block performs calculations in advanced floating point mode with extended precision (80 bit width), the result can easily depend on whether there were forced conversions of intermediate values to double (type with 64-bit floating point), B In this regard, in non-optimized code, it is not unusual for compilers to literally write double variables to memory, but ignore the "unnecessary" double casts applied to temporary intermediate values.

In your example, the first part involves storing the intermediate result in a double variable

 double doubleScaled = (double)scaleFactor * value; int scaledvalue1 = doubleScaled; // = 700

The compiler takes it literally and really saves the product in the double doubleScaled variable, which inevitably requires the conversion of an 80-bit product to double . Later, the double value is read from memory again and then converted to the int type.

The second part of

 int scaledvalue2 = (double)((double)(scaleFactor) * value); // = 699 ??

includes transformations that the compiler may consider unnecessary (and they really are not needed from the point of view of the abstract C ++ machine). The compiler ignores them, which means that the final int value is generated directly from the 80-bit product.

The presence of this intermediate conversion to double in the first embodiment (and its absence in the second) is the reason for this difference.

+1

AnT Nov 03 '16 at 18:14

source share

I converted the assembly code of the mindriot example to Intel syntax for testing with Visual Studio. I could only reproduce the error by setting the floating-point control word to use advanced precision.

The problem is that rounding is performed when converting from advanced precision to double precision while maintaining double or trimming when converting from advanced precision to an integer while maintaining an integer.

Increased multiplication accuracy gives the product 699.999 ..., but the product is rounded up to 700,000 ... during the conversion from extended to double precision, when the product is stored in doubleScaled.

 double doubleScaled = (double)scaleFactor * value;

Since doubleScaled == 700.000 ... when truncated to an integer, it is still 700:

 int scaledvalue1 = doubleScaled; // = 700

Product 699.999 ... is truncated when converted to an integer:

 int scaledvalue2 = (double)((double)(scaleFactor) * value); // = 699 ??

My guess is that the compiler generated a compilation time constant of 0f 700.000 ... instead of doing multiplication at runtime.

 int scaledvalue3 = (double)(1000.0 * 0.7); // = 700

This truncation problem can be avoided by using the round () function from the C standard library.

 int scaledvalue2 = (int)round(scaleFactor * value); // should == 700

+1

rcgldr Nov 04 '16 at 10:45

source share

Depending on the compiler and optimization flags, a scaledvalue_a that includes a variable can be evaluated at run time using processor floating point instructions, while a scaledvalue_b that includes only constants can be evaluated at compile time using a math library (for example, gcc uses GMP - GNU Multiple Precision math library for this). The difference you see is the difference between precision and rounding of the runtime and compilation time of this expression.

0

Spinynormam Nov 03 '16 at 16:38

source share

Due to rounding errors, most floating point numbers are ultimately slightly inaccurate. For the double to int conversion below use the std :: ceil () API

int scaledvalue2 = (double) ((double) (scaleFactor) * value); // = 699 ??

-3

aks Nov 03 '16 at 12:30

source share

Pete becker · Accepted Answer · 2016-11-03T17:51:03+0000

It will be a little manual work; I was too late last night watching newcomers win the World Series, so don't insist on accuracy.

The rules for evaluating floating point expressions are somewhat flexible, and compilers usually handle floating point expressions even more flexibly than the rules formally allow. This makes evaluating floating point expressions faster, because the results are slightly less predictable. Speed is important for floating point calculations. Java initially made a mistake by imposing exact requirements on floating point expressions, and the number community screamed in pain. Java had to be given to the real world and relaxed.

 double f(); double g(); double d = f() + g(); // 1 double dd1 = 1.6 * d; // 2 double dd2 = 1.6 * (f() + g()); // 3

On x86 hardware (i.e., approximately on every existing desktop system), floating point calculations are actually performed with an accuracy of 80 bits (unless you install some of them that kill performance, as Java requires), although double and float are equal respectively 64 and 32 bit. Thus, for arithmetic operations, the operands are converted to 80 bits, and the results are converted back to 64 or 32 bits. This is slow, so the generated code usually delays the execution of conversions for as long as possible, performing all the calculations with 80-bit precision.

But C and C ++ require that when the value is stored in a floating point variable, the conversion must be performed. Thus, formally, in line // 1, the compiler must convert the sum back to 64 bits in order to save it in the variable d . Then, the value of dd1 calculated in line // 2 must be calculated using the value that was stored in d , i.e. a 64-bit value, while the value of dd2 calculated in line // 3 can be calculated using f() + g() , i.e. The full 80-bit value. These additional bits may have a value, and the value of dd1 may differ from the value of dd2 .

And often the compiler will hang with the 80-bit value of f() + g() and use this instead of the value stored in d when it calculates the value of dd1 . This is an inappropriate optimization, but as far as I know, every compiler does this by default. All of them have command line switches to provide the strictly required behavior, so if you need slower code, you can get it. & L; r>

For a serious crystal number, speed is critical, so this flexibility is welcome, and the crunch code for numbers is carefully written to avoid being sensitive to such subtle differences. People get PhDs to figure out how to make floating point code fast and efficient, so don't feel bad that the results you see don't make sense. They do not do this, but they are close enough that they are carefully processed, they give the correct results without a speed penalty.

C ++ (double) 0.700 * int (1000) => 699 (Not a problem with double precision)

More articles: