Efficient floating point division with constant integer dividers

A recent question about whether compilers allow replacing floating point multiplication with floating point multiplication inspired me to ask this question.

In accordance with the strict requirement that the results after code conversion are beaten identical to the actual division operation, it is trivial to see that for binary arithmetic IEEE-754 this is possible for divisors that are a power of two. As long as the divisor is representable, multiplying by the inverse divisor, delivering results identical to division. For example, multiplying by 0.5 can replace division by 2.0 .

Then the question arises which other divisors work with such replacements, assuming that we allow any short sequence of commands that replaces division, but works much faster, while simultaneously providing bit-identical results. In particular, compiled operations with multiple additions are allowed in addition to simple multiplication. In the comments, I pointed to the following relevant document:

Nicholas Bricebarre, Jean-Michel Muller and Saurabh Kumar Raina. Acceleration of correctly rounded floating point division when the divisor is known in advance. IEEE Transactions on Computers, Vol. 53, No. 8, August 2004, pp. 1069-1072.

The method proposed by the authors of the article precomputes the inverse divisor y as a normalized head-tail pair z h : z l as follows: z h = 1 / y, z l = fma (-y, z h , 1) / y. The late division q = x / y is then calculated as q = fma (z h , x, z l * x). The document contains various conditions that the divisor y must satisfy in order to execute this algorithm. As you can easily see, this algorithm has problems with infinity and zero when the signs of the head and tail are different. More importantly, this will not produce the correct results for very small dividends x, because calculating the tail of the quotient, z l * x, suffers from underflow.

The document also makes reference to an alternative division algorithm based on FMA, first proposed by Peter Markstein when he was at IBM. Relevant link:

R. W. Markstein. Calculation of elementary functions on the IBM RISC System / 6000 processor. IBM Journal of Research and Development, Vol. 34, No. 1, January 1990, pp. 111-119

In the Markstein algorithm, the inverse rc is first calculated, from which the initial quotient q = x * rc is formed. Then the rest of the division is accurately calculated using FMA as r = fma (-y, q, x), and the improved, more accurate coefficient is finally calculated as q = fma (r, rc, q).

This algorithm also has problems for x, which are zeros or infinities (easily handled with appropriate conditional execution), but exhaustive testing using data with one precision float IEEE-754 shows that it provides the correct ratio for all possible dividends x for many divisors of y, among these many small integers. This C code implements it:

 /* precompute reciprocal */ rc = 1.0f / y; /* compute quotient q=x/y */ q = x * rc; if ((x != 0) && (!isinf(x))) { r = fmaf (-y, q, x); q = fmaf (r, rc, q); } 

In most processor architectures, this should translate into a non-expanding sequence of instructions using either predicates, conditional motions, or select type instructions. To give a concrete example: to divide by 3.0f , the nvcc CUDA 7.5 compiler generates the following machine code for the Kepler GPU:

  LDG.E R5, [R2]; // load x FSETP.NEU.AND P0, PT, |R5|, +INF , PT; // pred0 = fabsf(x) != INF FMUL32I R2, R5, 0.3333333432674408; // q = x * (1.0f/3.0f) FSETP.NEU.AND P0, PT, R5, RZ, P0; // pred0 = (x != 0.0f) && (fabsf(x) != INF) FMA R5, R2, -3, R5; // r = fmaf (q, -3.0f, x); MOV R4, R2 // q @P0 FFMA R4, R5, c[0x2][0x0], R2; // if (pred0) q = fmaf (r, (1.0f/3.0f), q) ST.E [R6], R4; // store q 

In my experiments, I wrote a tiny C test program, shown below, which goes through integer divisors in ascending order and for each of them exhaustively checks the above code sequence for proper division. It prints a list of divisors that pass this exhaustive test. Partial output is as follows:

 PASS: 1, 2, 3, 4, 5, 7, 8, 9, 11, 13, 15, 16, 17, 19, 21, 23, 25, 27, 29, 31, 32, 33, 35, 37, 39, 41, 43, 45, 47, 49, 51, 53, 55, 57, 59, 61, 63, 64, 65, 67, 69, 

To incorporate the replacement algorithm into the compiler as an optimization, whitelisting the divisors to which the above code conversion is safely applied is impractical. The output of the program so far (with a speed of about one result per minute) indicates that the fast code works correctly in all possible encodings x for those divisors of y that are odd integers or are powers of two. Anecdotal evidence, not proof, of course.

What set of mathematical conditions can determine a-priori, is it safe to convert division to the above code sequence? Answers may suggest that all floating-point operations are performed in the default rounding mode "from round to nearest or even".

 #include <stdlib.h> #include <stdio.h> #include <math.h> int main (void) { float r, q, x, y, rc; volatile union { float f; unsigned int i; } arg, res, ref; int err; y = 1.0f; printf ("PASS: "); while (1) { /* precompute reciprocal */ rc = 1.0f / y; arg.i = 0x80000000; err = 0; do { /* do the division, fast */ x = arg.f; q = x * rc; if ((x != 0) && (!isinf(x))) { r = fmaf (-y, q, x); q = fmaf (r, rc, q); } res.f = q; /* compute the reference, slowly */ ref.f = x / y; if (res.i != ref.i) { err = 1; break; } arg.i--; } while (arg.i != 0x80000000); if (!err) printf ("%g, ", y); y += 1.0f; } return EXIT_SUCCESS; } 
+45
c math floating-point algorithm division
Feb 20 '16 at 19:01
source share
4 answers

Let me restart for the third time. We are trying to speed up

  q = x / y 

where y is an integer constant, and q , x and y are all IEEE 754-2008 binary32 floating point values. Below fmaf(a,b,c) indicates a smooth multiple addition of a * b + c using binary values32.

A naive algorithm consists of a pre-calculated mutual value,

  C = 1.0f / y 

so at runtime, enough (much faster) multiplication:

  q = x * C 

Two pre-calculated constants are used in the Brisebar-Muller-Rain acceleration

  zh = 1.0f / y zl = -fmaf(zh, y, -1.0f) / y 

so at runtime, one multiplication and one merged multiple addition are enough:

  q = fmaf(x, zh, x * zl) 

The Markstein algorithm combines a naive approach with two fused multiple additives that give the correct result if the naive approach gives a result within 1 unit in the least significant place, pre-calculating

  C1 = 1.0f / y C2 = -y 

so that divisibility can be approximated by

  t1 = x * C1 t2 = fmaf(C1, t1, x) q = fmaf(C2, t2, t1) 



The naive approach works for all degrees of two y , but otherwise it's pretty bad. For example, for divisors 7, 14, 15, 28, and 30, it gives an incorrect result for more than half of all possible x .

The Brisebarre-Muller-Raina approach likewise fails for almost all invalid two y , but a lot less than x gives an incorrect result (less than half a percent of all possible x varies depending on y ).

The article of Brisebarre-Muller-Raina shows that the maximum error of the naive approach is ± 1.5 ULP.

The Markstein approach gives the correct results for powers of two y , as well as for an odd integer y . (I did not find a failed odd integer divisor for Markstein's approach.)




For the Markstein approach, I analyzed divisors 1 - 19700 ( raw data here ).

Building the number of cases of failure (the divider on the horizontal axis, the number of x values, where the Markstein approach is not performed for the specified divider), we see that a simple pattern occurs:

Markstein Failure Cases http://www.nominal-animal.net/answers/markstein.png

Note that these graphs have a logarithmic horizontal and vertical axis. There are no points for odd divisors, since the approach gives the correct results for all the odd divisors that I tested.

If we change the x axis to bit-bit (binary digits in the reverse order, that is, 0b11101101 → 0b10110111, data ) of the divisors, we have a very clear scheme: Markstein failure cases, bit-reversing divider http: //www.nominal-animal .net / answers / markstein-failures.png

If we draw a straight line through the center of the point sets, we get the curve 4194304/x . (Remember, the plot considers only half of the possible floats, therefore, considering all possible floats, double it.) 8388608/x and 2097152/x completely copy the entire error pattern.

Thus, if we use rev(y) to compute the bit 8388608/rev(y) divisor y , then 8388608/rev(y) is a good first-order approximation of the number of cases (out of the total possible float), where the Marxstein approach gives the wrong result for an even, not force divider y . (Or, 16777216/rev(x) for the upper limit.)

Added 2016-02-28: I found an approximation for the number of errors using the Markstein approach, given any integer (binary32) divisor. Here it is like pseudo code:

 function markstein_failure_estimate(divisor): if (divisor is zero) return no estimate if (divisor is not an integer) return no estimate if (divisor is negative) negate divisor # Consider, for avoiding underflow cases, if (divisor is very large, say 1e+30 or larger) return no estimate - do as division while (divisor > 16777216) divisor = divisor / 2 if (divisor is a power of two) return 0 if (divisor is odd) return 0 while (divisor is not odd) divisor = divisor / 2 # Use return (1 + 83833608 / divisor) / 2 # if only nonnegative finite float divisors are counted! return 1 + 8388608 / divisor 

This gives a correct error estimate with an accuracy of ± 1 for the cases of Marstein’s failure that I tested (but I have not yet passed an adequate check of divisors exceeding 8388608). The final separation should be such that it does not report false zeros, but I cannot guarantee it (for now). It does not take into account very large dividers (say, 0x1p100, or 1e + 30 or more in size) that have flow problems - I would just as well exclude such divisors from acceleration.

In preliminary testing, the assessment seems supernatural accuracy. I did not draw a graph comparing estimates and actual errors for dividers from 1 to 20,000, because all the points exactly coincide on the graphs. (Within this range, the estimate is accurate or too large). Essentially, the scores accurately reproduce the first graph in this answer.




The nature of the failures for Marstein’s approach is regular and very interesting. The approach works for all degrees of two divisors and all odd integer divisors.

For divisors larger than 16777216, I consistently see the same errors as for the divisor, which is divided by the least power of the two, to get a value less than 16777216. For example, 0x1.3cdfa4p + 23 and 0x1.3cdfa4p + 41, 0x1.d8874p + 23 and 0x1.d8874p + 32, 0x1.cf84f8p + 23 and 0x1.cf84f8p + 34, 0x1.e4a7fp + 23 and 0x1.e4a7fp + 37. (In each pair, the mantissa is the same, and only the strength of the two changes.)

Assuming my test bench is not wrong, this means that the Markstein approach also works with dividers larger than 16777216 (but smaller, for example, 1e + 30), if the divider is such that when dividing by the least power of the two, which gives a coefficient less than 16777216 in size, and the odd factor.

+6
Feb 27 '16 at 2:15
source share

This question asks how to determine the values ​​of the constant Y , which makes it possible to convert x / Y to a cheaper calculation using FMA for all possible values ​​of x . Another approach is the use of static analysis to determine an excessive approximation of x values, which can lead to the usual unreasonable transformation in the knowledge that values ​​for which the converted code differs from the original division do not occur.

Using representations of sets of floating point values ​​that are well adapted to the tasks of floating point calculations, even forward analysis, starting from the beginning of the function, can provide useful information. For example:

 float f(float z) { float x = 1.0f + z; float r = x / Y; return r; } 

Assuming that by default it is round to the nearest mode (*), in the above function x can only be NaN (if the input is NaN), + 0.0f or a number greater than 2 -24 in magnitude, but not -0.0f or something closer to zero than 2 -24 . This justifies the conversion to one of the two forms shown in the question for many values ​​of the constant Y

(*), without which many optimizations are impossible and what C compilers already do if the program explicitly uses #pragma STDC FENV_ACCESS ON




A preliminary static analysis, which predicts the information for x above, can be based on the representation of sets of floating point values, which the expression can take as a tuple:

  • represents a representation for the sets of possible NaN values ​​(since the behavior of NaN is defined below, the choice is to use only a logical value, with true means that some NaNs can be present, and false , indicating the absence of NaN.),
  • four logical flags indicating respectively the presence of + inf, -inf, +0.0, -0.0,
  • inclusive range of negative final floating point values ​​and
  • inclusive range of positive trailing floating point values.

To follow this approach, all floating point operations that may occur in a C program must be understood by a static analyzer. To illustrate, adding between sets of U and V values ​​that will be used for processing + in the analyzed code can be implemented as:

  • If NaN is present in one of the operands or if the operands can be infinities of opposite signs, NaN is present as a result.
  • If 0 cannot be the result of adding a U value and a V value, use standard interval arithmetic. The upper bound of the result is obtained for rounding to the nearest addition of the largest value in U and the largest value in V, so these estimates should be calculated with rounding to the nearest.
  • If 0 can be the result of adding a positive value of U and a negative value of V, then let M be the smallest positive value in U such that -M is present in V.
    • if succ (M) is present in U, then this pair of values ​​introduces succ (M) - M in the positive values ​​of the result.
    • if -succ (M) is present in V, then this pair of values ​​introduces a negative value of M - succ (M) into the negative values ​​of the result.
    • if pred (M) is present in U, then this pair of values ​​introduces a negative value pred (M) - M in the negative values ​​of the result.
    • if -pred (M) is present in V, then this pair of values ​​introduces the value M - pred (M) into the positive values ​​of the result.
  • We do the same job if 0 could be the result of adding a negative value of U and a positive value of V.

Confirmation: The above examines ideas from “Improving Floating Point Additions and Subtraction Limitations,” by Bruno Marre and Claude Michel




Example: compiling the function f below:

 float f(float z, float t) { float x = 1.0f + z; if (x + t == 0.0f) { float r = x / 6.0f; return r; } return 0.0f; } 

The approach in question refuses to convert the division into a function f into an alternative form, since 6 is not one of the values ​​for which the division can be unconditionally converted. Instead, I suggest applying a simple value analysis, starting from the beginning of the function, which in this case determines that x is a final float of either +0.0f , or at least 2 -24 in size and use this information to apply Brisebarre transforms, etc., confident in knowing that x * C2 not overflowing.

To be explicit, I suggest using an algorithm such as the one below to decide whether to divide the division into something simpler:

  • Is Y one of the values ​​that can be converted using the method of Brisebarre et al. according to their algorithm?
  • Do C1 and C2 have the same sign for their method, or can the possibility that the dividend be infinite be excluded?
  • Do C1 and C2 from their method have the same sign, or can x take only one of the two representations 0? If in the case when C1 and C2 have different signs, and x can be only one representation of zero, do not forget to squeak (**) with the signs of the calculation based on FMA so that it gives the correct zero when x is zero.
  • Is it possible to guarantee that the amount of the dividend will be large enough to exclude the possibility of not achieving x * C2 ?

If the answer to the four questions is yes, then division can be converted to multiplication and FMA in the context of the compiled function. The static analysis described above serves to answer questions 2., 3. and 4.

(**) "messing around with signs" means using -FMA (-C1, x, (-C2) * x) instead of FMA (C1, x, C2 * x), when it is necessary to create the result, it turns out correctly when x can only be one of two signed zeros

+7
Feb 21 '16 at 11:58
source share

The result of floating point division:

  • sign flag
  • significant
  • exponent
  • a set of flags (overflow, downstream, inaccurate, etc. - see fenv() )

Getting the first 3 pieces is correct (but the wrong set of flags). Without any additional knowledge (for example, which parts of them really matter, the possible values ​​of the dividend, etc.), I would suggest that replacing division by constant with multiplying by constant (and / or confusing FMA disorder) almost never not safe.

Besides; for modern processors, I also did not expect that replacing division by 2 FMA is always an improvement. For example, if the bottleneck is fetching / decoding the instruction, then this “optimization” will degrade performance. In another example, if the subsequent instructions are not dependent on the result (the processor can execute many other instructions in parallel waiting for the result), the FMA version can introduce several kiosks with dependencies and degrade performance. For the third example, if all registers are used, the FMA version (requiring additional live variables) can increase the spill and degrade performance.

, ( , ) , 2, ( , ).

+1
21 . '16 22:45
source share

@Pascal , , .

: .

, :

x/2 n

( base-10), :

x/(2 n * 5 m )

, m == 0, FP, , .

, , ( 2-) .01 0.99 :

 .25 .50 .75 

. ( , , lol.)

+1
Feb 22 '16 19:25
source share



All Articles