You have chosen one of the slowest verification methods.
c*c == a*a + b*b // assuming c is non-negative
This compiles with three integer multiplications (one of which can be taken out of the loop). Even without pow() you still convert to double and get the square root, which is terrible for bandwidth. (And also latency, but branch prediction + speculative execution on modern CPUs means that latency is not a factor here).
The Intel Haswell SQRTSD instruction has one throughput of 8-14 cycles ( source: Agner Fog instruction tables ), so even if your sqrt() supports the saturation of the FP sqrt executable, it is still about 4 times slower than what I got gcc for emitting (below).
You can also optimize the loop condition to exit the loop when part of the condition b < c becomes false, so the compiler should only perform one version of this check.
void foo_optimized() { for (int a = 1; a <= SUMTOTAL; a++) { for (int b = a+1; b < SUMTOTAL-ab; b++) {
Compiles (with gcc6.2 -O3 -mtune=haswell ) code with this inner loop. Full code explorer compiler Godbolt .
At Intel Haswell, all of these instructions are 1 MKP each. (And the cmp / jcc macro fuse gets into mappings and branches.) Thus, there are 10 fused-domain uops that can be issued in one iteration in 2.5 cycles .
Haswell starts imul r32, r32 with a throughput of one iteration per cycle, so two multiplications inside the inner loop do not saturate port 1 with two multiplications by 2.5c. This leaves room to soak up the inevitable resource conflicts from ADD and SUB stealing port 1.
We are not even close to any other execution bottlenecks , so the foreground bottleneck is the only problem, and this should be done in one iteration in 2.5 cycles on Intel Haswell, and then.
Loop-unrolling can help reduce the number of verification attempts here. for example, use lea ecx, [rbx+1] to calculate b + 1 for the next iteration, so we can imul ebx, ebx not use MOV to make it non-destructive.
A decrease in strength is also possible . Given b*b , we could try to compute (b-1) * (b-1) without IMUL. (b-1) * (b-1) = b*b - 2*b + 1 , so maybe we can do lea ecx, [rbx*2 - 1] , and then subtract from b*b . (There are no address labels that subtract instead of adding. Hmm, maybe we could keep -b in the register and count to zero, so we could use lea ecx, [rcx + rbx*2 - 1] to update b*b in ECX, given -b in EBX).
If you have not really encountered IMUL bandwidth, this can lead to more errors, rather than victory. It would be fun to see how well the compiler will do this with less power in the C ++ source.
Perhaps you could also vectorize this with SSE or AVX , while checking 4 or 8 consecutive b values. Since hits are really rare, you just check to see if any of the 8 hits had one, and then figured out which one was in the rare case of a match.
See also the x86 tag wiki for more optimization information.