Why are pow (int, int) so slow?

I am working on several Euler exercises to improve my knowledge of C ++.

I wrote the following function:

int a = 0,b = 0,c = 0; for (a = 1; a <= SUMTOTAL; a++) { for (b = a+1; b <= SUMTOTAL-a; b++) { c = SUMTOTAL-(a+b); if (c == sqrt(pow(a,2)+pow(b,2)) && b < c) { std::cout << "a: " << a << " b: " << b << " c: "<< c << std::endl; std::cout << a * b * c << std::endl; } } } 

This is calculated in 17 milliseconds.

However, if I change the line

 if (c == sqrt(pow(a,2)+pow(b,2)) && b < c) 

to

 if (c == sqrt((a*a)+(b*b)) && b < c) 

calculation takes place after 2 milliseconds. Are there some obvious implementation details of pow(int, int) that I skip that makes the first expression compute so slower?

+45
c ++ performance cmath pow
Dec 10 '16 at 6:19 06:19
source share
2 answers

pow() works with real floating point numbers and uses the formula under the hood

 pow(x,y) = e^(y log(x)) 

to calculate x^y . int converted to double before calling pow . ( log is the natural logarithm based on e)

x^2 using pow() , therefore slower than x*x .

Edit based on relevant comments

  • Using pow even with integer metrics can lead to incorrect results (PaulMcKenzie)
  • In addition to using a double type math function, pow is a function call (while x*x not) (jtbandes)
  • Many modern compilers actually optimize pow with constant integer arguments, but this should not be relied on.
+67
10 Dec '16 at 6:23
source share

You have chosen one of the slowest verification methods.

 c*c == a*a + b*b // assuming c is non-negative 

This compiles with three integer multiplications (one of which can be taken out of the loop). Even without pow() you still convert to double and get the square root, which is terrible for bandwidth. (And also latency, but branch prediction + speculative execution on modern CPUs means that latency is not a factor here).

The Intel Haswell SQRTSD instruction has one throughput of 8-14 cycles ( source: Agner Fog instruction tables ), so even if your sqrt() supports the saturation of the FP sqrt executable, it is still about 4 times slower than what I got gcc for emitting (below).




You can also optimize the loop condition to exit the loop when part of the condition b < c becomes false, so the compiler should only perform one version of this check.

 void foo_optimized() { for (int a = 1; a <= SUMTOTAL; a++) { for (int b = a+1; b < SUMTOTAL-ab; b++) { // int c = SUMTOTAL-(a+b); // gcc won't always transform signed-integer math, so this prevents hoisting (SUMTOTAL-a) :( int c = (SUMTOTAL-a) - b; // if (b >= c) break; // just changed the loop condition instead // the compiler can hoist a*a out of the loop for us if (/* b < c && */ c*c == a*a + b*b) { // Just print a newline. std::endl also flushes, which bloats the asm std::cout << "a: " << a << " b: " << b << " c: "<< c << '\n'; std::cout << a * b * c << '\n'; } } } } 

Compiles (with gcc6.2 -O3 -mtune=haswell ) code with this inner loop. Full code explorer compiler Godbolt .

 # a*a is hoisted out of the loop. It in r15d .L6: add ebp, 1 # b++ sub ebx, 1 # c-- add r12d, r14d # ivtmp.36, ivtmp.43 # not sure what this is or why it in the loop, would have to look again at the asm outside cmp ebp, ebx # b, _39 jg .L13 ## This is the loop-exit branch, not-taken until the end ## .L13 is the rest of the outer loop. ## It sets up for the next entry to this inner loop. .L8: mov eax, ebp # multiply a copy of the counters mov edx, ebx imul eax, ebp # b*b imul edx, ebx # c*c add eax, r15d # a*a + b*b cmp edx, eax # tmp137, tmp139 jne .L6 ## Fall-through into the cout print code when we find a match ## extremely rare, so should predict near-perfectly 

At Intel Haswell, all of these instructions are 1 MKP each. (And the cmp / jcc macro fuse gets into mappings and branches.) Thus, there are 10 fused-domain uops that can be issued in one iteration in 2.5 cycles .

Haswell starts imul r32, r32 with a throughput of one iteration per cycle, so two multiplications inside the inner loop do not saturate port 1 with two multiplications by 2.5c. This leaves room to soak up the inevitable resource conflicts from ADD and SUB stealing port 1.

We are not even close to any other execution bottlenecks , so the foreground bottleneck is the only problem, and this should be done in one iteration in 2.5 cycles on Intel Haswell, and then.

Loop-unrolling can help reduce the number of verification attempts here. for example, use lea ecx, [rbx+1] to calculate b + 1 for the next iteration, so we can imul ebx, ebx not use MOV to make it non-destructive.




A decrease in strength is also possible . Given b*b , we could try to compute (b-1) * (b-1) without IMUL. (b-1) * (b-1) = b*b - 2*b + 1 , so maybe we can do lea ecx, [rbx*2 - 1] , and then subtract from b*b . (There are no address labels that subtract instead of adding. Hmm, maybe we could keep -b in the register and count to zero, so we could use lea ecx, [rcx + rbx*2 - 1] to update b*b in ECX, given -b in EBX).

If you have not really encountered IMUL bandwidth, this can lead to more errors, rather than victory. It would be fun to see how well the compiler will do this with less power in the C ++ source.




Perhaps you could also vectorize this with SSE or AVX , while checking 4 or 8 consecutive b values. Since hits are really rare, you just check to see if any of the 8 hits had one, and then figured out which one was in the rare case of a match.

See also the x86 tag wiki for more optimization information.

+37
Dec 10 '16 at 17:50
source share



All Articles