Performance regression with Eigen 3.3.0 vs 3.2.10?

We are just porting our codebase to Eigen 3.3 (this was a commitment with all 32-byte alignment issues). However, there are a few places where the performance seems to have suffered a lot, contrary to expectations (I was looking forward to some acceleration, given the additional support for FMA and AVX ...). These include eigenvalue decomposition and the matrix * matrix.transpose () * vector products. I wrote two minimal working examples to demonstrate.

All tests are performed on a modern Arch Linux system using an Intel Core i7-4930K processor (3.40 GHz) and compiled with g ++ version 6.2.1.

1. Expansion of the eigenvalue:

A simple self-adjoint eigenvalue expansion is two times longer with Eigen 3.3.0, as with 3.2.10.

File test_eigen_EVD.cpp :

 #define EIGEN_DONT_PARALLELIZE #include <Eigen/Dense> #include <Eigen/Eigenvalues> #define SIZE 200 using namespace Eigen; int main (int argc, char* argv[]) { MatrixXf mat = MatrixXf::Random(SIZE,SIZE); SelfAdjointEigenSolver<MatrixXf> eig; for (int n = 0; n < 1000; ++n) eig.compute (mat); return 0; } 

Test results:

  • about own 3.2.10:

     g++ -march=native -O2 -DNDEBUG -isystem eigen-3.2.10 test_eigen_EVD.cpp -o test_eigen_EVD && time ./test_eigen_EVD real 0m5.136s user 0m5.133s sys 0m0.000s 
  • about own 3.3.0:

     g++ -march=native -O2 -DNDEBUG -isystem eigen-3.3.0 test_eigen_EVD.cpp -o test_eigen_EVD && time ./test_eigen_EVD real 0m11.008s user 0m11.007s sys 0m0.000s 

Not sure what could be causing this, but if someone can see a way to maintain performance with Eigen 3.3, I would like to know about it!

2. matrix * matrix.transpose () * vector product:

This particular example uses over 200 Γ— longer with Eigen 3.3.0 ...

File test_eigen_products.cpp :

 #define EIGEN_DONT_PARALLELIZE #include <Eigen/Dense> #define SIZE 200 using namespace Eigen; int main (int argc, char* argv[]) { MatrixXf mat = MatrixXf::Random(SIZE,SIZE); VectorXf vec = VectorXf::Random(SIZE); for (int n = 0; n < 50; ++n) vec = mat * mat.transpose() * VectorXf::Random(SIZE); return vec[0] == 0.0; } 

Test results:

  • about own 3.2.10:

     g++ -march=native -O2 -DNDEBUG -isystem eigen-3.2.10 test_eigen_products.cpp -o test_eigen_products && time ./test_eigen_products real 0m0.040s user 0m0.037s sys 0m0.000s 
  • about own 3.3.0:

     g++ -march=native -O2 -DNDEBUG -isystem eigen-3.3.0 test_eigen_products.cpp -o test_eigen_products && time ./test_eigen_products real 0m8.112s user 0m7.700s sys 0m0.410s 

Adding brackets to a line in a loop as follows:

  vec = mat * ( mat.transpose() * VectorXf::Random(SIZE) ); 

makes a huge difference: both versions of Eigen work equally well (actually 3.3.0 is slightly better) and faster than the unclassified case 3.2.10. So there is a fix. However, it is strange that 3.3.0 will struggle so much with this.

I do not know if this is a mistake, but I think it is worth reporting if this is what needs to be fixed. Or maybe I was just doing it wrong ...

Any thoughts appreciated. Hooray, Donald.


EDIT

As ggael noted , EVD in Eigen 3.3 is faster if compiled using clang++ or with -O3 with g++ . So problem 1 is fixed.

Problem 2 is actually not a problem, as I can just put brackets to force the most efficient order of operations. But just for completeness: there seems to be a flaw in evaluating these operations. Eigen is an incredible piece of software, I think it probably deserves a fix. Here's a modified version of MWE to show that it is unlikely to be associated with the first temporary product pulled out of the loop (at least as far as I can tell):

 #define EIGEN_DONT_PARALLELIZE #include <Eigen/Dense> #include <iostream> #define SIZE 200 using namespace Eigen; int main (int argc, char* argv[]) { VectorXf vec (SIZE), vecsum (SIZE); MatrixXf mat (SIZE,SIZE); for (int n = 0; n < 50; ++n) { mat = MatrixXf::Random(SIZE,SIZE); vec = VectorXf::Random(SIZE); vecsum += mat * mat.transpose() * VectorXf::Random(SIZE); } std::cout << vecsum.norm() << std::endl; return 0; } 

In this example, all operands are initialized in a loop and the results are accumulated in vecsum , so the compiler cannot vecsum anything or optimize unnecessary calculations. This shows the same behavior (this time testing with clang++ -O3 (version 3.9.0):

 $ clang++ -march=native -O3 -DNDEBUG -isystem eigen-3.2.10 test_eigen_products.cpp -o test_eigen_products && time ./test_eigen_products 5467.82 real 0m0.060s user 0m0.057s sys 0m0.000s $ clang++ -march=native -O3 -DNDEBUG -isystem eigen-3.3.0 test_eigen_products.cpp -o test_eigen_products && time ./test_eigen_products 5467.82 real 0m4.225s user 0m3.873s sys 0m0.350s 

The same result, but significantly different execution time. Fortunately, this is easy to solve by placing the brackets in the right places, but there seems to be a regression somewhere in Eigen 3.3, evaluating operations. By using parentheses around the mat.transpose() * VectorXf::Random(SIZE) , the execution time for both versions of Eigen is reduced to about 0.020 seconds (therefore, Eigen 3.2.10 clearly also benefits in this case). At least that means we can continue to get amazing performance from Eigen!

At the same time, I agree with ggael's answer, all I needed to know in order to move forward.

+6
source share
1 answer

For EVD, I cannot play with clang. With gcc, you need -O3 to avoid the inlay problem. Then, with both compilers, Eigen 3.3 will provide 33% speedup.

EDIT my previous answer about the product matrix*matrix*vector was wrong. This is a flaw in Eigen 3.3.0 and will be fixed in Eigen 3.3.1. For the record, I leave here my previous analysis, which is still partly valid:

As you noticed, you must add brackets to execute two matrix*vector instead of a large matrix*matrix product. Then the speed difference is easily explained by the fact that in 3.2, the embedded matrix*matrix product is immediately evaluated (at the time of nesting), while in 3.3 it is evaluated at the time of evaluation, which is in operator= . This means that in 3.2 the loop is equivalent to:

 for (int n = 0; n < 50; ++n) { MatrixXf tmp = mat * mat.transpose(); vec = tmp * VectorXf::Random(SIZE); } 

and thus, the compiler can move tmp out of the loop. Production code should not rely on the compiler for this kind of task and, rather, explicitly moves the constant expression outside of loops.

This is true, except that in practice the compiler is not smart enough to move the temporary exit from the loop.

+2
source

Source: https://habr.com/ru/post/1012652/


All Articles