How slowly (how many cycles) does the square root compute?

How slowly (how many cycles) does the square root compute? This manifested itself in the course of molecular dynamics, where efficiency is important, and unnecessary square roots had a noticeable effect on the running time of the algorithms.

+26
performance
Oct 11 2018-11-11T00:
source share
3 answers

From the Agner Fog instruction tables:

The Core2 65nm FSQRT takes from 9 to 69 cc. cm (with almost equal inverse bandwidth), depending on the bits of the value and accuracy. For comparison, FDIV takes from 9 to 38 cubic meters. See (with almost equal inverse bandwidth), FMUL takes 5 (recipthroughput = 2) and FADD takes 3 (recipthroughput = 1). SSE performance is about the same, but it looks faster because it cannot do 80-bit math. SSE has super fast approximate mutual and approximate mutual sqrt, though.

On Core2 45nm, division and square root are faster; FSQRT takes from 6 to 20 cubic meters. See, FDIV takes from 6 to 21 cubic meters. See, FADD and FMUL have not changed. Once again, SSE performance is about the same.

You can get documents with this information from your site .

+21
Oct 11 '11 at 12:48
source share

The square root is about 4 times slower than adding using -O2 , or about 13 times slower without using -O2 . Elsewhere on the net, I found estimates of 50-100 cycles that may be true, but this is not a relative measure of value, which is very useful, so I applied the code below to make a relative measurement. Let me know if you see any problems with the test code.

The code below was run on Intel Core i3 running the Windows 7 operating system and was compiled into DevC ++ (which uses GCC). Your mileage may vary.

 #include <cstdlib> #include <iostream> #include <cmath> /* Output using -O2: 1 billion square roots running time: 14738ms 1 billion additions running time : 3719ms Press any key to continue . . . Output without -O2: 10 million square roots running time: 870ms 10 million additions running time : 66ms Press any key to continue . . . Results: Square root is about 4 times slower than addition using -O2, or about 13 times slower without using -O2 */ int main(int argc, char *argv[]) { const int cycles = 100000; const int subcycles = 10000; double squares[cycles]; for ( int i = 0; i < cycles; ++i ) { squares[i] = rand(); } std::clock_t start = std::clock(); for ( int i = 0; i < cycles; ++i ) { for ( int j = 0; j < subcycles; ++j ) { squares[i] = sqrt(squares[i]); } } double time_ms = ( ( std::clock() - start ) / (double) CLOCKS_PER_SEC ) * 1000; std::cout << "1 billion square roots running time: " << time_ms << "ms" << std::endl; start = std::clock(); for ( int i = 0; i < cycles; ++i ) { for ( int j = 0; j < subcycles; ++j ) { squares[i] = squares[i] + squares[i]; } } time_ms = ( ( std::clock() - start ) / (double) CLOCKS_PER_SEC ) * 1000; std::cout << "1 billion additions running time : " << time_ms << "ms" << std::endl; system("PAUSE"); return EXIT_SUCCESS; } 
+11
Oct 11 2018-11-11T00: 00Z
source share

The square root takes several cycles, but accessing memory requires more orders if it is not in the cache. Therefore, trying to avoid computing by extracting pre-calculated results from memory can be detrimental to performance.

It's hard to say in the annotation whether you can get it or not, so if you want to know for sure, try to compare both approaches.

Here's a great talk about this Eric Brammer, MSVC compiler developer: http://channel9.msdn.com/Events/Build/2013/4-329

+6
Mar 19 '14 at 0:51
source share



All Articles