Matrix Multiply> for loop> bsxfun - Odd Speed ​​Results

I have an n by n matrix A, a set of n coefficients k (n by 1) and a matrix called a row (1 by n). My goal is to subtract the row weighted by the ith coefficient in k from the ith row A. I have three questions: Why does my for-loop work better than the built-in matrix, in general, which explains the superiority of each method over in the next, and is there a better way than the three that I came with?

% Define row & coefficients used in each method k = (1:1000).'; row = 1:1000; % Method 1 (matrix multiply) ~15 seconds A = magic(1e3); tic; for z = 1:1e3 A = A - k*row; end toc; % Method 2 (for loop) ~11 seconds B = magic(1e3); tic; for z = 1:1e3 for cr = 1:1000 B(cr,:) = B(cr,:) - k(cr)*row; end end toc; % method 3 (bsxfun) ~ 4 seconds C = magic(1e3); tic; for z = 1:1e3 C = C - bsxfun(@times, k, row); end toc isequal(A,B) isequal(A,C) 

Please note that I am doing these line subtractions in the algorithm. I simplified the code a bit by creating this test example of toys, but the essence of the calculation is still present. In addition, to avoid confusion, the for loop with z is used to increase time.

+4
source share
1 answer

Short answer : a faster version of your code is this:

 tic; for z = 1:1e3 for cr = 1:1000 B(:,cr) = B(:,cr) - k*row(cr); end end toc; 

You might want to see my previous answer to this question . In short, your loop worked in rows, and MATLAB worked in columns. This means that the columns are contiguous in memory. The original loop is repeated line by line, which is inefficient.

Runtime on my computer:

 % A - k*row Elapsed time is 4.370238 seconds. % B(cr,:) = B(cr,:) - k(cr)*row; Elapsed time is 9.537338 seconds. % C = C - bsxfun(@times, k, row); Elapsed time is 3.039836 seconds. B(:,cr) = B(:,cr) - k*row(cr); Elapsed time is 2.028186 seconds. 

The explanation . Your first version is not multiplied by a matrix, but an external product of two vectors, which leads to a 1000 x 1000 matrix. The hole calculation is a BLAS2 update of rank 1 (A = alphaxy '+ A - GER function). The problem, most likely, is that MATLAB does not recognize it as such and instead runs the code as it understands it, that is, it explicitly performs all operations, including k * row. This is the problem with this solution. An external product has to allocate additional memory equal to the size of your matrix, which in itself takes time. Consider this:

  • memory allocation - since we do not know how MATLAB controls memory allocation, this can mean a lot of overhead .
  • reading vectors k, string
  • Record Results Matrix (KR)
  • reading KR to subtract from A
  • read and write A

Two matrices 1000 * 1000 - 16 MB - I doubt that you have so much cache. For this reason, this version is not the best and may be slower than the "inefficient memory" cycle, depending on the available memory bandwidth and the size of the processor cache.

There is no need to select the KR matrix and store the values ​​explicitly in memory - you can calculate the necessary products in a loop. Therefore, you are actually half the memory bandwidth requirements and delete the memory allocation overhead!

  • reading vectors k, string
  • read and write a

Assuming that one vector fits into the cache (which it does - 1000 * 8 bytes are not many), you read k and a line from memory only once. Now an algorithm with a loop over columns makes sense (this is most likely how BLAS implements this calculation)

  • read k and line and store them in cache
  • stream A with full memory bandwidth to the processor, subtract the product k * row, the stream back to memory

Now the final considerations of efficiency. Take my loop structure. At each iteration, we read and write A and read the vectors. This is 16 MB of data transferred to iteration. 1000 iterations give 16 GB of data moved in total. The two seconds required to calculate the result provide an effective memory bandwidth of 8 GB / s. My system has a throughput of 16 GB / s when using 2 processor cores, about 11-12 GB / s when using it. Therefore, this sequential cycle operates at the level of 60-70% on one core. Not bad considering it's a MATLAB loop :)

Note also that, at least on my computer, the column-based loop version (slightly larger) is twice as fast as the A-krow implementation (2s versus 4.37s). This strongly indicates that krow is indeed explicitly executed and created by MATLAB, and the total amount of memory traffic is twice as much as in the looped version. Consequently, performance is two times worse.

Change You can try to do this, as in the first algorithm, by calling the corresponding BLAS function directly. Check out this FEX contribution . It allows you to call BLAS and LAPACK functions directly from MATLAB. May be helpful.

+6
source

Source: https://habr.com/ru/post/1435758/


All Articles