How unreliable are floating point values, operators, and functions?

Question

How unreliable are floating point values, operators, and functions?

I don't want to introduce floating point when the inaccurate value will be distaster, so I have a couple of questions about when you can really use them safely.

Are they accurate for integers unless you overflow the number of significant digits? Are these two tests always:

double d = 2.0; if (d + 3.0 == 5.0) ... if (d * 3.0 == 6.0) ...

What math functions can you rely on? These tests are always true:

 #include <math.h> double d = 100.0; if (log10(d) == 2.0) ... if (pow(d, 2.0) == 10000.0) ... if (sqrt(d) == 10.0) ...

How about this:

 int v = ...; if (log2((double) v) > 16.0) ... /* gonna need more than 16 bits to store v */ if (log((double) v) / log(2.0) > 16.0) ... /* C89 */

I think you can summarize this question as: 1) . Can floating point types hold the exact value of all integers to the number of their significant digits in float.h? 2) Do all operators and floating point functions guarantee that the result is close to the actual mathematical result?

+6

c floating-point

potrzebie Sep 27 '14 at 3:09

source share

2 answers

Floating-point types contain the exact value of all integers up to the number of their significant digits in float.h?

Well, they can store integers that fit into their mantissa (meaning). So, [-2 ^ 53, 2 ^ 53] for the double. See below for more on this: Which is the first integer that the floating IEEE 754 cannot represent exactly?

Do all operators and floating point functions ensure that the result is close to the actual mathematical result?

They at least guarantee that the result will be immediately on both sides of the actual mathematical result. That is, you will not get a result that has a real floating-point value between itself and the "actual" result. But be careful, as repeated operations can accumulate an error that seems to contradict this, while it is not (because all intermediate values obey the same restrictions, not just the inputs and outputs of the compound expression).

+3

John zwinck Sep 27 '14 at 3:21

source share

tmyklebu · Accepted Answer · 2014-09-27T05:58:37+0000

I also find the wrong results to be unpleasant.

On shared hardware, you can rely on + , - , * , / and sqrt work and deliver a correctly rounded result. That is, they deliver a floating point number closest to the sum, difference, product, quotient or square root of their argument or arguments.

Some library functions, in particular log2 and log10 and exp2 and exp10 , traditionally have terrible implementations that were not even exactly rounded. True-rounded means that the function delivers one of two floating-point numbers, copying the exact result. Most modern pow implementations have similar problems. Many of these functions will even cause exact cases, such as log10(10000) and pow(7, 2) . Thus, comparisons of comparisons with these functions, even in the exact case, require trouble.

sin , cos , tan , atan , exp and log have robust rounded implementations on every platform I recently encountered. In the bad old days, on processors that use the x87 FPU to evaluate sin , cos and tan , you will get terribly wrong outputs for large inputs, and you will get input for larger inputs. CRlibm has correctly rounded implementations; they are not basic, because, as I was told, they have more unpleasant worst cases than traditional realistically rounded implementations.

Things like copysign and nextafter and isfinite work correctly. ceil and floor and rint , and friends always give an accurate result. fmod and friends too. frexp and friends work. fmin and fmax work.

Someone thought it would be a brilliant idea to make fma(x,y,z) compute x*y+z , compute x*y rounded to double , then add z and round the result to double . This behavior can be found on modern platforms. This is stupid, and I hate him.

I have no experience with hyperbolic trigger, gamma, or Bessel functions in my C library.

I should also mention that the popular compilers for the x86 32-bit game are different from another, broken set of rules. Since x87 is the only supported set of floating point instructions, and all x87 arithmetic is performed with an extended metric, calculations that will lead to overflows or overflows with double precision may not overflow or overflow. Also, since x87 also uses the extended value by default, you may not get the results you are looking for. Worse, compilers will sometimes pass intermediate results into variables of lesser precision, so you cannot even rely on your calculations when doing double in extended precision. (Java has a trick for doing 64-bit math with 80-bit registers, but it's quite expensive.)

I would recommend sticking with arithmetic on a long double if you are targeting 32-bit x86. Compilers should set FLT_EVAL_METHOD to the appropriate value, but I do not know if this was done universally.

How unreliable are floating point values, operators, and functions?

More articles: