What is required for print accuracy for __float128 so as not to lose information?

I am trying to print __float128 using libquadmath, for example:

quadmath_snprintf(s, sizeof(s), "%.30Qg", f); 

With the following three constants:

  • The result should correspond to the following version:

      number = [ minus ] int [ frac ] [ exp ] decimal-point = %x2E ; . digit1-9 = %x31-39 ; 1-9 e = %x65 / %x45 ; e E exp = e [ minus / plus ] 1*DIGIT frac = decimal-point 1*DIGIT int = zero / ( digit1-9 *DIGIT ) minus = %x2D ; - plus = %x2B ; + zero = %x30 ; 0 
  • For any input __float128 "i" that was printed on a line corresponding to the above production "s" and then "s" is scanned back to __float128 "j" - "i" must be bitwise to "j" - i.e. no information should be lost. For at least some values, this is impossible (NaN, Infinity), what is the complete list of these values?

  • There should not be another line that satisfies the two above criteria, shorter than the candidate.

Is there a quadmath_snprintf format string that satisfies the above (1, 3, and 2 when possible)? If so, what is it?

What are the __float128 values ​​that cannot be represented accurately enough to satisfy paragraph 2 of the above statement? (e.g. Nan, +/- Infinity, etc.) How do I determine if __float128 supports one of these values?

+6
source share
2 answers

If you are on x86, then the GCC __float128 type is a software implementation of the IEEE 754-2008 binary128 format. The IEEE 754 standard requires that binary binary -> char -> binary code return the original value if the character representation contains 36 significant (decimal) digits. Thus, the format string "% .36Qg" should be executed.

The reverse NaN route is not required to restore the original bit value.

As for your requirement # 3, libquadmath does not contain code for this type of “shortest view” formatting, for example. in the spirit of styles and white paper or David Gay code.

+1
source

My intuition tells me that the binary fraction is 0.1111 ... 1 (128 units); also equal to 1-1 / 2 ** 128, will produce the largest number of overflows when converting to decimal. Convert this value to decimal (I don’t have a bignum package now), count the number of digits, add 2-3 from the top, and you should be safe. I have no mathematical proof that this is enough.

If I / O accuracy is important, I would prefer float output as a hexadecimal string. Accurate floating point IOs are hard to get, and the library may be a mistake in this regard.

0
source

Source: https://habr.com/ru/post/907099/


All Articles