What is required for print accuracy for __float128 so as not to lose information?

Question

What is required for print accuracy for __float128 so as not to lose information?

I am trying to print __float128 using libquadmath, for example:

quadmath_snprintf(s, sizeof(s), "%.30Qg", f);

With the following three constants:

The result should correspond to the following version:

  number = [ minus ] int [ frac ] [ exp ] decimal-point = %x2E ; . digit1-9 = %x31-39 ; 1-9 e = %x65 / %x45 ; e E exp = e [ minus / plus ] 1*DIGIT frac = decimal-point 1*DIGIT int = zero / ( digit1-9 *DIGIT ) minus = %x2D ; - plus = %x2B ; + zero = %x30 ; 0

For any input __float128 "i" that was printed on a line corresponding to the above production "s" and then "s" is scanned back to __float128 "j" - "i" must be bitwise to "j" - i.e. no information should be lost. For at least some values, this is impossible (NaN, Infinity), what is the complete list of these values?
There should not be another line that satisfies the two above criteria, shorter than the candidate.

Is there a quadmath_snprintf format string that satisfies the above (1, 3, and 2 when possible)? If so, what is it?

What are the __float128 values that cannot be represented accurately enough to satisfy paragraph 2 of the above statement? (e.g. Nan, +/- Infinity, etc.) How do I determine if __float128 supports one of these values?

+6

c gcc floating-point glibc quad

Andrew Tomazos Jan 28 '12 at 12:31

source share

2 answers

My intuition tells me that the binary fraction is 0.1111 ... 1 (128 units); also equal to 1-1 / 2 ** 128, will produce the largest number of overflows when converting to decimal. Convert this value to decimal (I don’t have a bignum package now), count the number of digits, add 2-3 from the top, and you should be safe. I have no mathematical proof that this is enough.

If I / O accuracy is important, I would prefer float output as a hexadecimal string. Accurate floating point IOs are hard to get, and the library may be a mistake in this regard.

0

zvrba Jan 28 '12 at 19:25

source share

janneb · Accepted Answer · 2012-01-28T21:47:11+0000

If you are on x86, then the GCC __float128 type is a software implementation of the IEEE 754-2008 binary128 format. The IEEE 754 standard requires that binary binary -> char -> binary code return the original value if the character representation contains 36 significant (decimal) digits. Thus, the format string "% .36Qg" should be executed.

The reverse NaN route is not required to restore the original bit value.

As for your requirement # 3, libquadmath does not contain code for this type of “shortest view” formatting, for example. in the spirit of styles and white paper or David Gay code.

What is required for print accuracy for __float128 so as not to lose information?

More articles: