Matlab matrix speed

Question

Matlab matrix speed

I was asked to make MATLAB code faster, and I came across something that seems strange to me.

One of the functions has a cycle in which we multiply the 3x1 vector (let's call it x ) - 3x3 matrix (let's call it A ) - and transpose x , giving a scalar. The code has the whole set of stepwise multiplications and additions and is rather cumbersome:

 val = x(1)*A(1,1)*x(1) + x(1)*A(1,2)*x(2) + x(1)*A(1,3)*x(3) + ... x(2)*A(2,1)*x(1) + x(2)*A(2,2)*x(2) + x(2)*A(2,3)*x(3) + ... x(3)*A(3,1)*x(1) + x(3)*A(3,2)*x(2) + x(3)*A(3,3)*x(3);

I decided that I would simply replace all of this:

 val = x*A*x';

To my surprise, it worked much slower (as 4-5 times slower). Is it just that the vector and matrix are so small that the MATLAB optimization is not applied?

+4

performance matrix matrix-multiplication matlab

user888379 Aug 30 '13 at 2:34

source share

2 answers

@Amro gave an extensive answer, and I agree that in general you should not worry about writing explicit calculations and just use matrix multiplication everywhere in your code.

However, if your matrix is small enough and you really need to calculate something a few billion times, the written form can be much faster (less overhead). However, the trick is not to put your code in a separate function, since the overhead of the call will be much longer than the calculation time.

Here is a smalle example:

 x = 1:3; A = rand(3); v=0; unroll = @(x) A(1)*(x(1)*x(1)) + A(5)*(x(2)*x(2)) + A(9)*(x(3)*x(3)) + A(4)*x(1)*x(2) + A(7)*x(1)*x(3) + A(2)*x(1)*x(2) + A(8)*x(2)*x(3) + A(3)*x(1)*x(3) + A(6)*x(2)*x(3); regular = @(x) x*A*x'; %Written out, no function call tic for t = 1:1e6 v = A(1)*(x(1)*x(1)) + A(5)*(x(2)*x(2)) + A(9)*(x(3)*x(3)) + A(4)*x(1)*x(2) + A(7)*x(1)*x(3) + A(2)*x(1)*x(2) + A(8)*x(2)*x(3) + A(3)*x(1)*x(3) + A(6)*x(2)*x(3);; end t1=toc; %Matrix form, no function call tic for t = 1:1e6 v = x*A*x'; end t2=toc; %Written out, function call tic for t = 1:1e6 v = unroll(x); end t3=toc; %Matrix form, function call tic for t = 1:1e6 v = regular(x); end t4=toc; [t1;t2;t3;t4]

What will give these results:

 0.0767 1.6988 6.1975 7.9353

So, if you call it through a (anonymous) function, you won’t be interested in using the written form, however, if you really want to get the maximum speed just by using the written form, you can get more acceleration for tiny matrices.

+2

Dennis jaheruddin Aug 30 '13 at 8:11

source share

Amro · Accepted Answer · 2013-08-30T05:36:10+0000

EDIT: I improved the tests to give a more accurate time. I also optimized the deployed version, which is now much better than what I originally had, yet matrix multiplication happens faster as the size increases.

EDIT2: To make sure the JIT compiler works with deployed functions, I modified the code to write the generated functions as M files. Also, the comparison can now be considered fair, since both methods are evaluated by passing the TIMEIT to the function descriptor: timeit(@myfunc)

I'm not sure your approach is faster than matrix multiplication for reasonable sizes. So let's compare the two methods.

I use the Symbolic Math Toolbox to help me get the "expanded" form of the equation x'*A*x (try manually multiplying the 20x20 matrix and the 20x1 vector!):

 function f = buildUnrolledFunction(N) % avoid regenerating files, CCODE below can be really slow! fname = sprintf('f%d',N); if exist([fname '.m'], 'file') f = str2func(fname); return end % construct symbolic vector/matrix of the specified size x = sym('x', [N 1]); A = sym('A', [NN]); % work out the expanded form of the matrix-multiplication % and convert it to a string s = ccode(expand(x.'*A*x)); % instead of char(.) to avoid x^2 % a bit of RegExp to fix the notation of the variable names % also convert indexing into linear indices: A(3,3) into A(9) s = regexprep(regexprep(s, '^.*=\s+', ''), ';$', ''); s = regexprep(regexprep(s, 'x(\d+)', 'x($1)'), 'A(\d+)_(\d+)', ... 'A(${ int2str(sub2ind([NN],str2num($1),str2num($2))) })'); % build an M-function from the string, and write it to file fid = fopen([fname '.m'], 'wt'); fprintf(fid, 'function v = %s(A,x)\nv = %s;\nend\n', fname, s); fclose(fid); % rehash path and return a function handle rehash clear(fname) f = str2func(fname); end

I tried to optimize the generated function by avoiding exponentiation (we prefer x*x to x^2 ). I also converted indexes to linear indexes ( A(9) instead of A(3,3) ). Therefore, for n=3 we get the same equation as yours:

 >> s s = A(1)*(x(1)*x(1)) + A(5)*(x(2)*x(2)) + A(9)*(x(3)*x(3)) + A(4)*x(1)*x(2) + A(7)*x(1)*x(3) + A(2)*x(1)*x(2) + A(8)*x(2)*x(3) + A(3)*x(1)*x(3) + A(6)*x(2)*x(3)

Given the above-described method for constructing M-functions, we now evaluate it for different sizes and compare it with the matrix-multiplication form (I put it in a separate function to take into account function calls). I use TIMEIT instead of tic/toc to get more accurate timings. Also, in order to have a fair comparison, each method is implemented as a function of the M file, which passes all the necessary variables as input arguments.

 function results = testMatrixMultVsUnrolled() % vector/matrix size N_vec = 2:50; results = zeros(numel(N_vec),3); for ii = 1:numel(N_vec); % some random data N = N_vec(ii); x = rand(N,1); A = rand(N,N); % matrix multiplication f = @matMult; results(ii,1) = timeit(@() feval(f, A,x)); % unrolled equation f = buildUnrolledFunction(N); results(ii,2) = timeit(@() feval(f, A,x)); % check result results(ii,3) = norm(matMult(A,x) - f(A,x)); end % display results fprintf('N = %2d: mtimes = %.6f ms, unroll = %.6f ms [error = %g]\n', ... [N_vec(:) results(:,1:2)*1e3 results(:,3)]') plot(N_vec, results(:,1:2)*1e3, 'LineWidth',2) xlabel('size (N)'), ylabel('timing [msec]'), grid on legend({'mtimes','unrolled'}) title('Matrix multiplication: $$x^\mathsf{T}Ax$$', ... 'Interpreter','latex', 'FontSize',14) end function v = matMult(A,x) v = x.' * A * x; end

Results:

 N = 2: mtimes = 0.008816 ms, unroll = 0.006793 ms [error = 0] N = 3: mtimes = 0.008957 ms, unroll = 0.007554 ms [error = 0] N = 4: mtimes = 0.009025 ms, unroll = 0.008261 ms [error = 4.44089e-16] N = 5: mtimes = 0.009075 ms, unroll = 0.008658 ms [error = 0] N = 6: mtimes = 0.009003 ms, unroll = 0.008689 ms [error = 8.88178e-16] N = 7: mtimes = 0.009234 ms, unroll = 0.009087 ms [error = 1.77636e-15] N = 8: mtimes = 0.008575 ms, unroll = 0.009744 ms [error = 8.88178e-16] N = 9: mtimes = 0.008601 ms, unroll = 0.011948 ms [error = 0] N = 10: mtimes = 0.009077 ms, unroll = 0.014052 ms [error = 0] N = 11: mtimes = 0.009339 ms, unroll = 0.015358 ms [error = 3.55271e-15] N = 12: mtimes = 0.009271 ms, unroll = 0.018494 ms [error = 3.55271e-15] N = 13: mtimes = 0.009166 ms, unroll = 0.020238 ms [error = 0] N = 14: mtimes = 0.009204 ms, unroll = 0.023326 ms [error = 7.10543e-15] N = 15: mtimes = 0.009396 ms, unroll = 0.024767 ms [error = 3.55271e-15] N = 16: mtimes = 0.009193 ms, unroll = 0.027294 ms [error = 2.4869e-14] N = 17: mtimes = 0.009182 ms, unroll = 0.029698 ms [error = 2.13163e-14] N = 18: mtimes = 0.009330 ms, unroll = 0.033295 ms [error = 7.10543e-15] N = 19: mtimes = 0.009411 ms, unroll = 0.152308 ms [error = 7.10543e-15] N = 20: mtimes = 0.009366 ms, unroll = 0.167336 ms [error = 7.10543e-15] N = 21: mtimes = 0.009335 ms, unroll = 0.183371 ms [error = 0] N = 22: mtimes = 0.009349 ms, unroll = 0.200859 ms [error = 7.10543e-14] N = 23: mtimes = 0.009411 ms, unroll = 0.218477 ms [error = 8.52651e-14] N = 24: mtimes = 0.009307 ms, unroll = 0.235668 ms [error = 4.26326e-14] N = 25: mtimes = 0.009425 ms, unroll = 0.256491 ms [error = 1.13687e-13] N = 26: mtimes = 0.009392 ms, unroll = 0.274879 ms [error = 7.10543e-15] N = 27: mtimes = 0.009515 ms, unroll = 0.296795 ms [error = 2.84217e-14] N = 28: mtimes = 0.009567 ms, unroll = 0.319032 ms [error = 5.68434e-14] N = 29: mtimes = 0.009548 ms, unroll = 0.339517 ms [error = 3.12639e-13] N = 30: mtimes = 0.009617 ms, unroll = 0.361897 ms [error = 1.7053e-13] N = 31: mtimes = 0.009672 ms, unroll = 0.387270 ms [error = 0] N = 32: mtimes = 0.009629 ms, unroll = 0.410932 ms [error = 1.42109e-13] N = 33: mtimes = 0.009605 ms, unroll = 0.434452 ms [error = 1.42109e-13] N = 34: mtimes = 0.009534 ms, unroll = 0.462961 ms [error = 0] N = 35: mtimes = 0.009696 ms, unroll = 0.489474 ms [error = 5.68434e-14] N = 36: mtimes = 0.009691 ms, unroll = 0.512198 ms [error = 8.52651e-14] N = 37: mtimes = 0.009671 ms, unroll = 0.544485 ms [error = 5.68434e-14] N = 38: mtimes = 0.009710 ms, unroll = 0.573564 ms [error = 8.52651e-14] N = 39: mtimes = 0.009946 ms, unroll = 0.604567 ms [error = 3.41061e-13] N = 40: mtimes = 0.009735 ms, unroll = 0.636640 ms [error = 3.12639e-13] N = 41: mtimes = 0.009858 ms, unroll = 0.665719 ms [error = 5.40012e-13] N = 42: mtimes = 0.009876 ms, unroll = 0.697364 ms [error = 0] N = 43: mtimes = 0.009956 ms, unroll = 0.730506 ms [error = 2.55795e-13] N = 44: mtimes = 0.009897 ms, unroll = 0.765358 ms [error = 4.26326e-13] N = 45: mtimes = 0.009991 ms, unroll = 0.800424 ms [error = 0] N = 46: mtimes = 0.009956 ms, unroll = 0.829717 ms [error = 2.27374e-13] N = 47: mtimes = 0.010210 ms, unroll = 0.865424 ms [error = 2.84217e-13] N = 48: mtimes = 0.010022 ms, unroll = 0.907974 ms [error = 3.97904e-13] N = 49: mtimes = 0.010098 ms, unroll = 0.944536 ms [error = 5.68434e-13] N = 50: mtimes = 0.010153 ms, unroll = 0.984486 ms [error = 4.54747e-13]

With small sizes, the two methods perform somewhat similarly. Although for N<7 extended version is superior to mtimes , the difference is hardly significant. As soon as we pass by tiny sizes, matrix multiplication is several orders of magnitude faster.

No wonder; Only N=20 formula results :

 N=3: Elapsed time is 0.062295 seconds. % unroll Elapsed time is 1.117962 seconds. % mtimes N=12: Elapsed time is 1.024837 seconds. % unroll Elapsed time is 1.126147 seconds. % mtimes N=19: Elapsed time is 140.915138 seconds. % unroll Elapsed time is 1.305382 seconds. % mtimes

... and it only gets worse for the deployed version. As I said, MATLAB is interpreted so that the cost of parsing the code begins to appear in such huge files.

The way I see it, after doing a million iterations, we only got 1 second at best, which, I think, does not justify all the problems and hacks, using the more read and compressed v=x'*A*x . So maybe there are other places in the code that can be improved, instead of focusing on an already optimized operation, such as matrix multiplication.

Matrix multiplication in MATLAB is seriously fast (this is what MATLAB works best for!). It really shines as soon as you reach big enough data (like multithreading )

 >> N=5000; x=rand(N,1); A=rand(N,N); >> tic, for i=1e4, v=x.'*A*x; end, toc Elapsed time is 0.021959 seconds.

Matlab matrix speed

More articles: