C ++ own library of linear algebra, odd results

I have been using my own linear algebra library in C ++ for a while, and I have always tried to take advantage of the performance of vectorization. Today I decided to check how much vectorization really speeds up my programs. So, I wrote the following test program:

--- eigentest.cpp ---

#include <eigen3/Eigen/Dense> using namespace Eigen; #include <iostream> int main() { Matrix4d accumulator=Matrix4d::Zero(); Matrix4d randMat = Matrix4d::Random(); Matrix4d constMat = Matrix4d::Constant(2); for(int i=0; i<1000000; i++) { randMat+=constMat; accumulator+=randMat*randMat; } std::cout<<accumulator(0,0)<<"\n"; // To avoid optimizing everything away return 0; } 

Then I launched this program after compilation with various compiler options: (The results are not one-time, many runs give similar results)

 $ g++ eigentest.cpp -o eigentest -DNDEBUG -std=c++0x -march=native $ time ./eigentest 5.33334e+18 real 0m4.409s user 0m4.404s sys 0m0.000s $ g++ eigentest.cpp -o eigentest -DNDEBUG -std=c++0x $ time ./eigentest 5.33334e+18 real 0m4.085s user 0m4.040s sys 0m0.000s $ g++ eigentest.cpp -o eigentest -DNDEBUG -std=c++0x -march=native -O3 $ time ./eigentest 5.33334e+18 real 0m0.147s user 0m0.136s sys 0m0.000s $ g++ eigentest.cpp -o eigentest -DNDEBUG -std=c++0x -O3 $time ./eigentest 5.33334e+18 real 0m0.025s user 0m0.024s sys 0m0.000s 

And here is my relevant processor information:

 model name : AMD Athlon(tm) 64 X2 Dual Core Processor 5600+ flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow extd_apicid pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy 3dn 

I know that there is no vectorization when I do not use the -march=native compiler -march=native , because when I do not use it, I never get a segmentation error or an incorrect result due to vectorization, unlike if I use it (with -NDEBUG ).

These results make me believe that at least on my vectorization of the processor with eigen3 we get slower execution. Who should I blame? My CPU, eigen3 or gcc?

Edit: To eliminate all doubts, I now tried to add the -DEIGEN_DONT_ALIGN parameter of the compiler in cases where I try to measure the performance of a case without vectorization, and the results are the same. Also, when I add -DEIGEN_DONT_ALIGN along with -march=native , the results are very close to the case without -march=native .

+4
source share
1 answer

It seems that the compiler is smarter than you think, and still optimizes a lot of things.

On my platform, I get about 9 ms without -march=native and about 39 ms with -march=native . However, if I replace the line above the return with

 std::cout<<accumulator<<"\n"; 

then the timings are changed to 78 ms without -march=native and about 39 ms with -march=native .

Thus, it seems that without vectorization, the compiler understands that you are using only the (0,0) element of the matrix and therefore it only calculates this element. However, he cannot do this optimization if vector insertion is allowed.

If you output the entire matrix, forcing the compiler to calculate all the records, then vectorization will speed up the program with a coefficient of 2, as expected (although I am surprised to see that this is exactly factor 2 in my timings).

+9
source

Source: https://habr.com/ru/post/1403460/


All Articles