Convert a rational arbitrary precision number (OCaml, zarith) to an approximate floating point number

I use the Zarith library to do rational arithmetic of arbitrary precision. Suppose I have a rational number q type Qt , that is, the ratio of two large integers ( q is the module of the rational number of numbers of arbitrary accuracy Zarith). Sometimes I would like to print this number as a floating point number for readability, and sometimes I need to convert this floating point number for subsequent calculations without arbitrary precision. Is there a way to convert q to floating point to a certain level of accuracy?

The method of converting q to a floating point now has no guarantees and can create undefined floating point numbers ( Z is an integer module of arbitrary precision):

 let to_float q = let n, d = num q, den q in (* check if d is zero and raise an error if it is *) let nf, df = Z.to_float n, Z.to_float d in nf /. df 

Is there a better way to handle this, where can I get a floating point that most closely approximates any q ?

Edit

I quickly wrote Mark Dickinson's answer in OCaml if anyone is interested. It can (definitely) be improved and refined. I will edit if I do this, or if anyone has any improvement tips. But for now, this solved my problem!

 let to_float q = let n, d = num q, den q in let n_sign = Z.sign n in let d_sign = Z.sign d in (* always >= 0 *) if d_sign = 0 then raise Division_by_zero; let n = Z.abs n in if n_sign = 0 then 0. else let shift = (Z.numbits n) - (Z.numbits d) - 55 in let is_subnormal = shift < -1076 in let shift = if is_subnormal then -1076 else shift in let d = if shift >= 0 then Z.shift_left d shift else d in let n = if shift < 0 then Z.shift_left n (-shift) else n in let quotient, remainder = Z.div_rem nd in let quotient = if (Z.compare remainder (Z.zero)) = 0 && Z.is_even quotient then Z.add Z.one quotient else quotient in let quotient = if not is_subnormal then quotient else let round_select = Z.to_int @@ Z.rem quotient @@ Z.of_int 8 in Z.add quotient [|Z.zero;Z.minus_one;Z.of_int (-2);Z.one;Z.zero ;Z.minus_one;Z.of_int 2;Z.one|].(round_select) in let unsigned_res = ldexp (Z.to_float quotient) shift in if n_sign = 1 then unsigned_res else -.unsigned_res 

Now I will consider creating an interface for the GMP mpq_get_d function, but I'm not quite sure how to do this. The only way I can see is to convert q : Qt to a string and pass to:

 int mpq_set_str (mpq_t rop, const char *str, int base) 

Does anyone know how to pass rop to mpq_get_d in OCaml or get a link that describes how to do this? I looked through chapter 19 of the RWO and did not see such a situation.

+5
source share
2 answers

If you have access to

  • integer log2 operation and
  • the ability to left shift an integer by a given number of bits

then it is relatively easy to collapse your correct rounded transform. In a nutshell, the method is as follows:

  • Reduce to the case n > 0 , d > 0 ; filter out obvious underflow / overflow
  • Choose the integer shift , so 2^-shift*n/d is between 2^54 and 2^56 .
  • Use integer arithmetic to compute x = 2^-shift*n/d rounded to the nearest integer using the round-to-odd rounding method.
  • Convert x to the closest IEEE 754 value with double precision dx , with the usual rounding mode to even rounding.
  • Return ldexp(dx, shift) .

I'm afraid I don't own OCaml, but the following Python code illustrates the idea of ​​positive inputs. I leave this for you to make obvious changes for negative inputs and for division by zero. You can also make an early return for cases of extreme overflow and underflow: they are easy to spot by looking for larger or smaller shift values ​​below.

 from math import ldexp def to_float(numerator, denominator): """ Convert numerator / denominator to float, correctly rounded. For simplicity, assume both inputs are positive. """ # Shift satisfies 2**54 < (numerator / denominator) / 2**shift < 2**56 shift = numerator.bit_length() - denominator.bit_length() - 55 # Divide the fraction by 2**shift. if shift >= 0: denominator <<= shift else: numerator <<= -shift # Convert to the nearest integer, using round-to-odd. q, r = divmod(numerator, denominator) if r != 0 and q % 2 == 0: q += 1 # Now convert to the nearest float and shift back. return ldexp(float(q), shift) 

Some notes:

  • The bit_length method for a positive integer n gives the number of bits needed to represent n , or, in other words, 1 + floor(log2(n)) .
  • divmod is a Python function that simultaneously calculates the factor and remainder of integer division.
  • q value fits (easily) into a 64-bit integer
  • we round two times: once when converting the shifted numerator / denominator to the nearest integer and again when rounding this integer to a float. In the first round, the round-to-odd method is used; this ensures that the second round (implicit in the conversion from int to float) gives the same result as if we rounded the fraction directly in the float.
  • the above algorithm does not correctly handle fractions whose converted float value is subnormal: in this case, the ldexp operation represents a potentially third rounding. This can be taken care of with some caution. See below code.

The above is actually a simplified version of the algorithm that Python uses when dividing one (large) integer by another to get a floating point result. Here you can see the source here . The comment at the beginning of the long_true_divide function gives an overview of the method.

For completeness, here is an option that also applies correctly to subnormal results.

 def to_float(numerator, denominator): """ Convert numerator / denominator to float, correctly rounded. For simplicity, assume both inputs are positive. """ # Choose shift so that 2**54 < numerator / denominator / 2**shift < 2**56 shift = numerator.bit_length() - denominator.bit_length() - 55 # The 'treat_as_subnormal' flag catches all cases of subnormal results, # along with some cases where the result is not subnormal but *is* still # smaller than 2**-1021. In all these cases, it sufficient to find the # closest integer multiple of 2**-1074. We first round to the nearest # multiple of 2**-1076 using round-to-odd. treat_as_subnormal = shift < -1076 if treat_as_subnormal: shift = -1076 # Divide the fraction by 2**shift. if shift >= 0: denominator <<= shift else: numerator <<= -shift # Convert to the nearest integer, using round-to-odd. q, r = divmod(numerator, denominator) if r != 0 and q % 2 == 0: q += 1 # Now convert to the nearest float and shift back. if treat_as_subnormal: # Round to the nearest multiple of 4, rounding ties to # the nearest multiple of 8. This avoids double rounding # from the ldexp call below. q += [0, -1, -2, 1, 0, -1, 2, 1][q%8] return ldexp(float(q), shift) 
+4
source

This is not a complete answer, but looking around, I discovered that Zarit uses GMP internally. There is a GMP function called mpq_get_d that converts the rational to double. If it is not available directly in Zarith, it should be possible (given some time) to add an interface for it.

+2
source

Source: https://habr.com/ru/post/1235595/


All Articles