I have been using my own linear algebra library in C ++ for a while, and I have always tried to take advantage of the performance of vectorization. Today I decided to check how much vectorization really speeds up my programs. So, I wrote the following test program:
--- eigentest.cpp ---
#include <eigen3/Eigen/Dense> using namespace Eigen; #include <iostream> int main() { Matrix4d accumulator=Matrix4d::Zero(); Matrix4d randMat = Matrix4d::Random(); Matrix4d constMat = Matrix4d::Constant(2); for(int i=0; i<1000000; i++) { randMat+=constMat; accumulator+=randMat*randMat; } std::cout<<accumulator(0,0)<<"\n"; // To avoid optimizing everything away return 0; }
Then I launched this program after compilation with various compiler options: (The results are not one-time, many runs give similar results)
$ g++ eigentest.cpp -o eigentest -DNDEBUG -std=c++0x -march=native $ time ./eigentest 5.33334e+18 real 0m4.409s user 0m4.404s sys 0m0.000s $ g++ eigentest.cpp -o eigentest -DNDEBUG -std=c++0x $ time ./eigentest 5.33334e+18 real 0m4.085s user 0m4.040s sys 0m0.000s $ g++ eigentest.cpp -o eigentest -DNDEBUG -std=c++0x -march=native -O3 $ time ./eigentest 5.33334e+18 real 0m0.147s user 0m0.136s sys 0m0.000s $ g++ eigentest.cpp -o eigentest -DNDEBUG -std=c++0x -O3 $time ./eigentest 5.33334e+18 real 0m0.025s user 0m0.024s sys 0m0.000s
And here is my relevant processor information:
model name : AMD Athlon(tm) 64 X2 Dual Core Processor 5600+ flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow extd_apicid pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy 3dn
I know that there is no vectorization when I do not use the -march=native compiler -march=native , because when I do not use it, I never get a segmentation error or an incorrect result due to vectorization, unlike if I use it (with -NDEBUG ).
These results make me believe that at least on my vectorization of the processor with eigen3 we get slower execution. Who should I blame? My CPU, eigen3 or gcc?
Edit: To eliminate all doubts, I now tried to add the -DEIGEN_DONT_ALIGN parameter of the compiler in cases where I try to measure the performance of a case without vectorization, and the results are the same. Also, when I add -DEIGEN_DONT_ALIGN along with -march=native , the results are very close to the case without -march=native .