MATLAB speed optimization

Can anyone help? I am a pretty experienced Matlab user, but I am unable to speed up the code below.

The fastest time that I could achieve in one pass through all three loops using 12 cores is ~ 200 s. The actual function will be called ~ 720 times, and the speed will take more than 40 hours. According to the Matlab profiler, most of the processor’s time is spent calling an exponential function. I was able to significantly speed this up using gpuArray and then run the exp call on the Quadro 4000 graphics card, however this prevents the use of the parfor loop, since there is only one video card on the workstation that destroys any profit. Can someone help, or is this code close to the optimal one that can be achieved with Matlab? I wrote a very crude implementation of C ++ using openMP, but got a small gain.

Thank you very much in advance

function SPEEDtest_CPU % Variable setup: % - For testing I'll use random variables. These will actually be fed into % the function for the real version of this code. sy = 320; sx = 100; sz = 32; A = complex(rand(sy,sx,sz),rand(sy,sx,sz)); B = complex(rand(sy,sx,sz),rand(sy,sx,sz)); C = rand(sy,sx); D = rand(sy*sx,1); F = zeros(sy,sx,sz); x = rand(sy*sx,1); y = rand(sy*sx,1); x_ind = (1:sx) - (sx / 2) - 1; y_ind = (1:sy) - (sy / 2) - 1; % MAIN LOOPS % - In the real code this set of three loops will be called ~720 times! % - Using 12 cores, the fastest I have managed is ~200 seconds for one % call of this function. tic for z = 1 : sz A_slice = A(:,:,z); A_slice = A_slice(:); parfor cx = 1 : sx for cy = 1 : sy E = ( x .* x_ind(cx) ) + ( y .* y_ind(cy) ) + ( C(cy,cx) .* D ); F(cy,cx,z) = (B(cy,cx,z) .* exp(-1i .* E))' * A_slice; end end end toc end 
+6
source share
8 answers

Some things to think about:

Have you considered using singles?

Can you vectorize the cx, cy part so that they represent array operations?

Consider changing rounding or floating point alarms.

+3
source

If your data is real (not complicated), as in your example, you can save time on replacement

 (B(cy,cx,z) .* exp(-1i .* E))' 

by

 (B(cy,cx,z) .* (cos(E)+1i*sin(E))).' 

In particular, on my machine (cos(x)+1i*sin(x)).' takes 19% less time than exp(-1i .* x)' .


If A and B are complex: E is still large, so you can Bconj = conj(B) outside of the loops (it takes about 10 ms with the data size, and this only happens once), and then replace

 (B(cy,cx,z) .* exp(-1i .* E))' 

by

 (Bconj(cy,cx,z) .* (cos(E)+1i*sin(E))).' 

to get a similar gain.

+2
source

There are two main ways to speed up MATLAB code; preallocation and vectorization.

You are pre-distributed, but no vectorization. To learn more about how to do this, you need to understand linear algebra well and use repmat to expand vectors into several dimensions.

Vectorization can lead to acceleration by several orders of magnitude and optimally use the kernel (provided that the flag is raised).

What math expression do you calculate and can I give a hand?

+1
source

You can move x .* x_ind(cx) from the innermost loop. I don’t have a GPU for testing timings, but you can split the code into three sections so you can use the GPU and parkour

 for z = 1 : sz E = zeros(sy*sx,sx,sy); A_slice = A(:,:,z); A_slice = A_slice(:); parfor cx = 1 : sx temp = ( x .* x_ind(cx) ); for cy = 1 : sy E(:, cx, cy) = temp + ( y .* y_ind(cy) ) + ( C(cy,cx) .* D ); end end temp = zeros(zeros(sy*sx,sx,sy)); for cx = 1 : sx for cy = 1 : sy % Ideally use your GPU magic here temp(:, cx, cy) = exp(-1i .* E(:, cx, cy))); end end parfor cx = 1 : sx for cy = 1 : sy F(cy,cx,z) = (B(cy,cx,z) .* temp(:, cx, cy)' * A_slice; end end end 
+1
source

To ensure proper parallelization, you need to make sure the loops are completely independent, so check to see if assignment E helps in each run.

Also, try to vectorize one simple example as much as possible: y.*y_ind(cy)

If you simply create the correct index for all values ​​at the same time, you can pull this out of the bottom loop.

0
source

Not sure if this helps with speed - but since E is basically a sum, maybe you can use this exp (i cx(A+1)x) = exp(i cx(A) x) * exp(ix) and exp(ix) , which can be calculated in advance.

That way, you would not need to evaluate exp each iteration, but just need to multiply, which should be faster.

0
source

In addition to the other good recommendations given by others here, multiplying by A_slice is independent of the cx,cy loops and can be inferred outside them by multiplying F after both cycles have completed.

Similarly, the conjugation of B*exp(...) can also be done en-bulk outside the cycle cx,cy before multiplying by A_slice .

0
source

This line: (x. * X_ind (cx)) + (y. * Y_ind (cy)) + (C (cy, cx). * D);

is some type of convolution, isn't it? Circular convolution is much faster in the frequency domain, and conversion to / from the frequency domain is optimized using FTT.

0
source

Source: https://habr.com/ru/post/955243/


All Articles