Why is matrix acceleration multiplying more slowly than mine?

Question

Why is matrix acceleration multiplying more slowly than mine?

I implemented one matrix multiplication with boost::numeric::ublas::matrix (see my full working raise code )

 Result result = read (); boost::numeric::ublas::matrix<int> C; C = boost::numeric::ublas::prod(result.A, result.B);

and another with the standard algorithm (see full standard code ):

 vector< vector<int> > ijkalgorithm(vector< vector<int> > A, vector< vector<int> > B) { int n = A.size(); // initialise C with 0s vector<int> tmp(n, 0); vector< vector<int> > C(n, tmp); for (int i = 0; i < n; i++) { for (int k = 0; k < n; k++) { for (int j = 0; j < n; j++) { C[i][j] += A[i][k] * B[k][j]; } } } return C; }

This is how I check the speed:

 time boostImplementation.out > boostResult.txt diff boostResult.txt correctResult.txt time simpleImplementation.out > simpleResult.txt diff simpleResult.txt correctResult.txt

Both programs read a hard-coded text file that contains two 2000 x 2000 matrices. Both programs were compiled with these flags:

 g++ -std=c++98 -Wall -O3 -g $(PROBLEM).cpp -o $(PROBLEM).out -pedantic

I got 15 seconds for my implementation and more than 4 minutes for better performance!

edit: after compiling with

 g++ -std=c++98 -Wall -pedantic -O3 -D NDEBUG -DBOOST_UBLAS_NDEBUG library-boost.cpp -o library-boost.out

I got 28.19 seconds for the ikj algorithm and 60.99 seconds for Boost. Thus, Boost is still significantly slower.

Why is it so much slower than my implementation?

+45

c ++ performance boost ublas boost-ublas

Martin Thoma Jun 19 '12 at 22:50

source share

3 answers

I believe that your compiler is not optimized enough. uBLAS code makes heavy use of templates, and templates require heavy use of optimizations. I ran your code through MS VC 7.1 compiler in release mode for 1000x1000 matrices, it gives me

10.064 for uBLAS

7.851 for vector

The difference still exists, but by no means overwhelming. The main concept of uBLAS is lazy assessment, therefore prod(A, B) evaluates the results only if necessary, for example. prod(A, B)(10,100) will execute in the blink of an eye, since only that one element will actually be calculated. As such, in fact, ~~there is no dedicated algorithm for the complete matrix multiplication that could be optimized~~ (see below). But you can help the library a bit by declaring

 matrix<int, column_major> B;

will reduce runtime to 4.426 , which will beat your function with one hand. This declaration makes memory access more consistent with matrix multiplication, optimizing cache usage.

PS After reading the uBLAS documentation to the end;), you should have discovered that there really is a special function for multiplying whole matrices at once. 2 functions - axpy_prod and opb_prod . So

 opb_prod(A, B, C, true);

even in an unoptimized row row_major B runs at 8.091 sec and is on par with your vector algorithm

PPS There are even more optimizations:

 C = block_prod<matrix<int>, 1024>(A, B);

runs in 4.4 s, regardless of whether B is column_ or row_ major. Consider the description: "The block_prod function is for large dense matrices." Choose specific tools for specific tasks!

+13

panda-34 Jun 21 '12 at 11:23

source share

I created a small website Matrix Matrix Experiments with uBLAS . It's about integrating a new implementation of the matrix product into uBLAS. If you already have a boost library, it consists of only 4 files. Thus, it is quite self-sufficient.

I would be interested if others could run simple tests on different machines.

+2

Michael Lehn Jan 22 '16 at 10:45

source share

vitaut · Accepted Answer · 2012-06-19 23:36

The lower performance of the uBLAS version can be partially explained by the debugging functions of the latter, as indicated by TJD.

Here is the time taken by the debug version of uBLAS:

 real 0m19.966s user 0m19.809s sys 0m0.112s

Here is the time taken by the debug version of uBLAS ( -DNDEBUG -DBOOST_UBLAS_NDEBUG compiler flags added):

 real 0m7.061s user 0m6.936s sys 0m0.096s

Thus, when debugging, the uBLAS version is almost 3 times faster.

The remaining performance difference can be explained by specifying the next section of the uBLAS FAQ “Why is uBLAS so much slower than (atlas-) BLAS”

An important goal of ublas design should be as general as possible.

This community almost always comes with value. In particular, the prod function template can handle various types of matrices, such as sparse or triangular. Fortunately, uBLAS provides alternatives optimized for dense matrix multiplication, in particular axpy_prod and block_prod . The following are the results of comparing different methods:

 ijkalgorithm prod axpy_prod block_prod 1.335 7.061 1.330 1.278

As you can see, both axpy_prod and block_prod somewhat faster than your implementation. Measuring only the calculation time without I / O, removing unnecessary copying and carefully choosing the block size for block_prod (I used 64) can make the difference deeper.

See also the uBLAS FAQ and Effective uBlas and General Code Optimization .

Why is matrix acceleration multiplying more slowly than mine?

More articles: