I believe that your compiler is not optimized enough. uBLAS code makes heavy use of templates, and templates require heavy use of optimizations. I ran your code through MS VC 7.1 compiler in release mode for 1000x1000 matrices, it gives me
10.064 for uBLAS
7.851 for vector
The difference still exists, but by no means overwhelming. The main concept of uBLAS is lazy assessment, therefore prod(A, B) evaluates the results only if necessary, for example. prod(A, B)(10,100) will execute in the blink of an eye, since only that one element will actually be calculated. As such, in fact, there is no dedicated algorithm for the complete matrix multiplication that could be optimized (see below). But you can help the library a bit by declaring
matrix<int, column_major> B;
will reduce runtime to 4.426 , which will beat your function with one hand. This declaration makes memory access more consistent with matrix multiplication, optimizing cache usage.
PS After reading the uBLAS documentation to the end;), you should have discovered that there really is a special function for multiplying whole matrices at once. 2 functions - axpy_prod and opb_prod . So
opb_prod(A, B, C, true);
even in an unoptimized row row_major B runs at 8.091 sec and is on par with your vector algorithm
PPS There are even more optimizations:
C = block_prod<matrix<int>, 1024>(A, B);
runs in 4.4 s, regardless of whether B is column_ or row_ major. Consider the description: "The block_prod function is for large dense matrices." Choose specific tools for specific tasks!
panda-34 Jun 21 '12 at 11:23 2012-06-21 11:23
source share