It also depends on the combination of commands. Your processor will have several computing devices standing at any moment, and you will get maximum throughput if all of them are full all the time. Thus, executing a mul loop is as fast as executing a loop or adding - but the same thing does not happen if the expression becomes more complex.
For example, take this loop:
for(int j=0;j<NUMITER;j++) { for(int i=1;i<NUMEL;i++) { bla += 2.1 + arr1[i] + arr2[i] + arr3[i] + arr4[i] ; } }
for NUMITER = 10 ^ 7, NUMEL = 10 ^ 2, both arrays are initialized with small positive numbers (NaN is much slower), this takes 6.0 seconds using doubles in 64-bit proc. If I replaced the loop with
bla += 2.1 * arr1[i] + arr2[i] + arr3[i] * arr4[i] ;
It only takes 1.7 seconds ... so since we “overloaded” the add-ons, the muls were essentially free; and reducing supplements helped. This gets more confusing:
bla += 2.1 + arr1[i] * arr2[i] + arr3[i] * arr4[i] ;
- the same distribution is mul / add, but now the constant is added, not multiplied - it takes 3.7 seconds. Your processor is probably optimized to perform typical numerical computations more efficiently; therefore, the dot product, like the sums of muls and scaled sums, is about as good as it gets; adding constants is not so common, so slower ...
bla += someval + arr1[i] * arr2[i] + arr3[i] * arr4[i] ; /*someval == 2.1*/
takes 1.7 seconds again.
bla += someval + arr1[i] + arr2[i] + arr3[i] + arr4[i] ; /*someval == 2.1*/
(same as the original loop, but without the costly constant addition: 2.1 seconds)
bla += someval * arr1[i] * arr2[i] * arr3[i] * arr4[i] ; /*someval == 2.1*/
(mostly muls, but one addition: 1.9 seconds)
So basically; it’s hard to say that it’s faster, but if you want to avoid bottlenecks, it’s more important to have a reasonable mix, avoid NaN or INF, avoid adding constants. No matter what you do, make sure you test and verify the various compiler settings, as often small changes can just make a difference.
A few more cases:
bla *= someval; // someval very near 1.0; takes 2.1 seconds bla *= arr1[i] ;// arr1[i] all very near 1.0; takes 66(!) seconds bla += someval + arr1[i] * arr2[i] + arr3[i] * arr4[i] ; // 1.6 seconds bla += someval + arr1[i] * arr2[i] + arr3[i] * arr4[i] ; //32-bit mode, 2.2 seconds bla += someval + arr1[i] * arr2[i] + arr3[i] * arr4[i] ; //32-bit mode, floats 2.2 seconds bla += someval * arr1[i]* arr2[i];// 0.9 in x64, 1.6 in x86 bla += someval * arr1[i];// 0.55 in x64, 0.8 in x86 bla += arr1[i] * arr2[i];// 0.8 in x64, 0.8 in x86, 0.95 in CLR+x64, 0.8 in CLR+x86