A recent question about whether compilers allow replacing floating point multiplication with floating point multiplication inspired me to ask this question.
In accordance with the strict requirement that the results after code conversion are beaten identical to the actual division operation, it is trivial to see that for binary arithmetic IEEE-754 this is possible for divisors that are a power of two. As long as the divisor is representable, multiplying by the inverse divisor, delivering results identical to division. For example, multiplying by 0.5 can replace division by 2.0 .
Then the question arises which other divisors work with such replacements, assuming that we allow any short sequence of commands that replaces division, but works much faster, while simultaneously providing bit-identical results. In particular, compiled operations with multiple additions are allowed in addition to simple multiplication. In the comments, I pointed to the following relevant document:
Nicholas Bricebarre, Jean-Michel Muller and Saurabh Kumar Raina. Acceleration of correctly rounded floating point division when the divisor is known in advance. IEEE Transactions on Computers, Vol. 53, No. 8, August 2004, pp. 1069-1072.
The method proposed by the authors of the article precomputes the inverse divisor y as a normalized head-tail pair z h : z l as follows: z h = 1 / y, z l = fma (-y, z h , 1) / y. The late division q = x / y is then calculated as q = fma (z h , x, z l * x). The document contains various conditions that the divisor y must satisfy in order to execute this algorithm. As you can easily see, this algorithm has problems with infinity and zero when the signs of the head and tail are different. More importantly, this will not produce the correct results for very small dividends x, because calculating the tail of the quotient, z l * x, suffers from underflow.
The document also makes reference to an alternative division algorithm based on FMA, first proposed by Peter Markstein when he was at IBM. Relevant link:
R. W. Markstein. Calculation of elementary functions on the IBM RISC System / 6000 processor. IBM Journal of Research and Development, Vol. 34, No. 1, January 1990, pp. 111-119
In the Markstein algorithm, the inverse rc is first calculated, from which the initial quotient q = x * rc is formed. Then the rest of the division is accurately calculated using FMA as r = fma (-y, q, x), and the improved, more accurate coefficient is finally calculated as q = fma (r, rc, q).
This algorithm also has problems for x, which are zeros or infinities (easily handled with appropriate conditional execution), but exhaustive testing using data with one precision float IEEE-754 shows that it provides the correct ratio for all possible dividends x for many divisors of y, among these many small integers. This C code implements it:
rc = 1.0f / y; q = x * rc; if ((x != 0) && (!isinf(x))) { r = fmaf (-y, q, x); q = fmaf (r, rc, q); }
In most processor architectures, this should translate into a non-expanding sequence of instructions using either predicates, conditional motions, or select type instructions. To give a concrete example: to divide by 3.0f , the nvcc CUDA 7.5 compiler generates the following machine code for the Kepler GPU:
LDG.E R5, [R2]; // load x FSETP.NEU.AND P0, PT, |R5|, +INF , PT; // pred0 = fabsf(x) != INF FMUL32I R2, R5, 0.3333333432674408; // q = x * (1.0f/3.0f) FSETP.NEU.AND P0, PT, R5, RZ, P0; // pred0 = (x != 0.0f) && (fabsf(x) != INF) FMA R5, R2, -3, R5; // r = fmaf (q, -3.0f, x); MOV R4, R2 // q @P0 FFMA R4, R5, c[0x2][0x0], R2; // if (pred0) q = fmaf (r, (1.0f/3.0f), q) ST.E [R6], R4; // store q
In my experiments, I wrote a tiny C test program, shown below, which goes through integer divisors in ascending order and for each of them exhaustively checks the above code sequence for proper division. It prints a list of divisors that pass this exhaustive test. Partial output is as follows:
PASS: 1, 2, 3, 4, 5, 7, 8, 9, 11, 13, 15, 16, 17, 19, 21, 23, 25, 27, 29, 31, 32, 33, 35, 37, 39, 41, 43, 45, 47, 49, 51, 53, 55, 57, 59, 61, 63, 64, 65, 67, 69,
To incorporate the replacement algorithm into the compiler as an optimization, whitelisting the divisors to which the above code conversion is safely applied is impractical. The output of the program so far (with a speed of about one result per minute) indicates that the fast code works correctly in all possible encodings x for those divisors of y that are odd integers or are powers of two. Anecdotal evidence, not proof, of course.
What set of mathematical conditions can determine a-priori, is it safe to convert division to the above code sequence? Answers may suggest that all floating-point operations are performed in the default rounding mode "from round to nearest or even".
#include <stdlib.h> #include <stdio.h> #include <math.h> int main (void) { float r, q, x, y, rc; volatile union { float f; unsigned int i; } arg, res, ref; int err; y = 1.0f; printf ("PASS: "); while (1) { /* precompute reciprocal */ rc = 1.0f / y; arg.i = 0x80000000; err = 0; do { /* do the division, fast */ x = arg.f; q = x * rc; if ((x != 0) && (!isinf(x))) { r = fmaf (-y, q, x); q = fmaf (r, rc, q); } res.f = q; /* compute the reference, slowly */ ref.f = x / y; if (res.i != ref.i) { err = 1; break; } arg.i--; } while (arg.i != 0x80000000); if (!err) printf ("%g, ", y); y += 1.0f; } return EXIT_SUCCESS; }