Specific MKL matrix multiplication performance difference between Fortran / Python / MATLAB

I wrote a trivial test comparing the performance of matrix multiplication in three languages ​​- Fortran (using Intel Parallel Studio 2015, compiling with ifort switches: / O3 / Qopt-prefetch = 2 / Qopt-matmul / Qmkl: parallel, this replaces MatMul calls with calls to the Intel MKL library), Python (using the current version of Anaconda, including Anaconda Accelerate, which supplies NumPy 1.9.2 related to the Intel MKL library) and MATLAB R2015a (which, again, multiplies the matrix using the Intel MKL Library).

Having seen how all three implementations use the same Intel MKL library for matrix multiplication, I would expect the results to be almost identical, especially for matrices that are large enough for function overheads to become negligible. However, this is far from being true, while MATLAB and Python demonstrate almost identical performance, Fortran is 2-3 times superior. I would like to understand why.

Here is the code I used for the version of Fortran:

program MatMulTest

implicit none

integer, parameter :: N = 1024
integer :: i, j, cr, cm
real*8 :: t0, t1, rate
real*8 :: A(N,N), B(N,N), C(N,N)    

call random_seed()
call random_number(A)
call random_number(B)

! First initialize the system_clock
CALL system_clock(count_rate=cr)
CALL system_clock(count_max=cm)
rate = real(cr)
WRITE(*,*) "system_clock rate: ", rate

call cpu_time(t0)
do i = 1, 100, 1
    C=MatMul(A,B)                
end do
call cpu_time(t1)

write(unit=*, fmt="(a24,f10.5,a2)") "Average time spent: ", (t1-t0), "ms"
write(unit=*, fmt="(a24,f10.3)") "First element of C: ", C(1,1)

end program MatMulTest

Please note that if your system clock speed is not 10000, as in my case, you need to change the clock calculation to get milliseconds.

Python Code:

import time
import numpy as np

def main(N):
    A = np.random.rand(N,N)
    B = np.random.rand(N,N)
    for i in range(100):
        C = np.dot(A,B)
    print C[0,0]

if __name__ == "__main__":
    N = 1024
    t0 = time.clock()
    main(N)
    t1 = time.clock()
    print "Time elapsed: " + str((t1-t0)*10) + " ms"

And finally, a MATLAB fragment:

N=1024;
A=rand(N,N); B=rand(N,N);
tic;
for i=1:100
     C=A*B;
end
t=toc;
disp(['Time elapsed: ', num2str(t*10), ' milliseconds'])

On my system, the results are as follows:

Fortran: 38.08 ms
Python: 104.29 ms
MATLAB: 97.36 ms

( 47-49% i7-920D0 HT ). , , , (N < 80 ) Fortran.

- ? - ? , , , Fortran .

+4
1

:

  • Python , , Fortran MATLAB
  • Fortran , Python MATLAB. , 46%, .

... date_and_time(), cpu_time().

+6

Source: https://habr.com/ru/post/1607478/


All Articles