That the relative speed of adding floating point versus multiplying floating point

Ten years ago, it was useful to write a numerical code to avoid the use of multiplications and divisions and use addition and subtraction instead. A good example is the use of advanced differences to estimate a polynomial curve instead of directly calculating the polynomial.

Is this still the case, or have modern computer architectures advanced to the point where *, / is not many times slower than +, -?

To be specific, I'm interested in compiled C / C ++ code that runs on modern typical x86 chips with extensive on-board floating-point equipment, rather than a small microprocessor that tries to make FP in software. I understand that pipelining and other architectural improvements exclude certain amounts of cycles, but I would still like to get a useful intuition.

+25
floating-point x86 mips flops numerical-computing
Jul 18 '09 at 1:49
source share
6 answers

It also depends on the combination of commands. Your processor will have several computing devices standing at any moment, and you will get maximum throughput if all of them are full all the time. Thus, executing a mul loop is as fast as executing a loop or adding - but the same thing does not happen if the expression becomes more complex.

For example, take this loop:

for(int j=0;j<NUMITER;j++) { for(int i=1;i<NUMEL;i++) { bla += 2.1 + arr1[i] + arr2[i] + arr3[i] + arr4[i] ; } } 

for NUMITER = 10 ^ 7, NUMEL = 10 ^ 2, both arrays are initialized with small positive numbers (NaN is much slower), this takes 6.0 seconds using doubles in 64-bit proc. If I replaced the loop with

 bla += 2.1 * arr1[i] + arr2[i] + arr3[i] * arr4[i] ; 

It only takes 1.7 seconds ... so since we “overloaded” the add-ons, the muls were essentially free; and reducing supplements helped. This gets more confusing:

 bla += 2.1 + arr1[i] * arr2[i] + arr3[i] * arr4[i] ; 

- the same distribution is mul / add, but now the constant is added, not multiplied - it takes 3.7 seconds. Your processor is probably optimized to perform typical numerical computations more efficiently; therefore, the dot product, like the sums of muls and scaled sums, is about as good as it gets; adding constants is not so common, so slower ...

 bla += someval + arr1[i] * arr2[i] + arr3[i] * arr4[i] ; /*someval == 2.1*/ 

takes 1.7 seconds again.

 bla += someval + arr1[i] + arr2[i] + arr3[i] + arr4[i] ; /*someval == 2.1*/ 

(same as the original loop, but without the costly constant addition: 2.1 seconds)

 bla += someval * arr1[i] * arr2[i] * arr3[i] * arr4[i] ; /*someval == 2.1*/ 

(mostly muls, but one addition: 1.9 seconds)

So basically; it’s hard to say that it’s faster, but if you want to avoid bottlenecks, it’s more important to have a reasonable mix, avoid NaN or INF, avoid adding constants. No matter what you do, make sure you test and verify the various compiler settings, as often small changes can just make a difference.

A few more cases:

 bla *= someval; // someval very near 1.0; takes 2.1 seconds bla *= arr1[i] ;// arr1[i] all very near 1.0; takes 66(!) seconds bla += someval + arr1[i] * arr2[i] + arr3[i] * arr4[i] ; // 1.6 seconds bla += someval + arr1[i] * arr2[i] + arr3[i] * arr4[i] ; //32-bit mode, 2.2 seconds bla += someval + arr1[i] * arr2[i] + arr3[i] * arr4[i] ; //32-bit mode, floats 2.2 seconds bla += someval * arr1[i]* arr2[i];// 0.9 in x64, 1.6 in x86 bla += someval * arr1[i];// 0.55 in x64, 0.8 in x86 bla += arr1[i] * arr2[i];// 0.8 in x64, 0.8 in x86, 0.95 in CLR+x64, 0.8 in CLR+x86 
+20
Jul 18 '09 at 18:22
source share

In theory, the information is here:

Intel® 64 and IA-32 Architecture Optimization Reference Guide, APPENDIX C LITERATURE INSTRUCTIONS AND THROUGH SOFTWARE

For each processor they list, the latency on FMUL is very close to the delay of FADD or FDIV. On some older processors, FDIV is 2-3 times slower than on older processors, the same as that of FMUL.

Cautions:

  • The document I'm referring to actually says that you cannot rely on these numbers in real life, as the processor will do what it wants to do faster if it fixes.

  • There is a good chance that your compiler will decide to use one of the many new instruction sets that have floating point multiplication / splitting.

  • This is a complex read-only document by the compiler authors, and I may have been mistaken. I don’t understand why the FDIV delay number is completely missing for some processors.

+16
Jul 18 '09 at 17:22
source share

The best way to answer this question is to actually write the benchmark / processing profile that you need to do. Empirical should be used theoretically when possible. Especially when it is easy to achieve.

If you already know the different math implementations you need to do, you can write several different math code transfers and see where your performance peaks are. This will allow the processor / compiler to generate various threads for filling the processor pipelines and give you a specific answer to your answer.

If you are interested in the specific performance of instructions like DIV / MUL / ADD / SUB, you can even drop it into some kind of built-in assembly to specifically control which variants of this command are executed. However, you need to make sure that you run multiprocessor actuators to get an idea of ​​the performance that the system is capable of.

In addition, something similar will allow you to compare the performance of several processor options by simply running the same program on them, and also allow you to take into account the differences in the motherboard.

Edit:

The basic architecture of a + is identical. Thus, they logically take the same time to calculate. * On the other hand, it takes several levels, usually built from "full adders" to complete one operation. This ensures that while * can be piped every cycle, it will have a higher delay than the add / subtract scheme. The operation fp / is usually implemented using the approximation method, which iteratively converges to the correct answer over time. These types of approximations are usually realized through multiplication. Thus, for a floating point, you can assume that the division will take more time, because it is impractical to “deploy” the multiplications (which are already a large circuit inside and on yourself) into the pipeline of many multiplier circuits. However, the performance of this system is best measured through testing.

+7
Jul 18 '09 at 16:35
source share

I can’t find the final link, but extensive experimentation tells me that float multiplication is currently about the same speed as addition and subtraction, while division is not (but not "many times" slower). You can get the intuition you want only by running your own experiments - do not forget to create random numbers (millions) in advance, read them before starting to count the time, and use processor performance counters (without any other process, as you can stop them ) for accurate measurement!

+1
Jul 18 '09 at 2:15
source share

The difference in speed * / vs + - depends on the processor architecture. In general, and with x86, in particular, with modern processors, the difference in speed has become smaller. * should be close to + if you have doubts: just experiment. If you have a really serious problem with a lot of FP operations, consider using your GPU (GeForce, ...), which works like a vector processor.

+1
Jul 18 '09 at 2:29
source share

There are probably very few time differences between multiplication and addition. separation, on the other hand, is still significantly slower than multiplication due to its recursive nature. on modern x86 architecture, sse instructions should be considered when performing floating point operations, rather than using fpu. Although a good C / C ++ compiler should give you the option to use sse instead of fpu.

-one
Jul 18 '09 at 2:21
source share



All Articles