Most likely, the problem is that direct (possibly single-threaded) matrix transformation is usually performed with an optimized library function. In the case of OpenBLAS, this is already multithreaded. For 2000x2000 arrays, simple matrix multiplication
@time c = sa * sb;
results in 0.3 seconds of multithreading and 0.7 seconds of write once.
Separation of one dimension during multiplication, times become even worse and reach about 17 seconds in single-point mode.
@time for j = 1:n sc[:,j] = sa[:,:] * sb[:,j] end
shared arrays
The solution to your problem may be to use shared arrays that use the same data in your processes on the same computer. Note that shared arrays are still marked as experimental.
Then you need to create a function that performs cheap matrix multiplication on a subset of the matrix.
@everywhere function mymatmul!(n,w,sa,sb,sc)
Finally, the main process tells workers to work on their part.
@time @sync begin for w in workers() @async remotecall_wait(w, mymatmul!, n, w, sa, sb, sc) end end
which takes about 0.3 seconds , which is the same time as multi-threaded single-processor time.