I've written a trivial benchmark comparing matrix multiplication performance in three languages - Fortran (using Intel Parallel Studio 2015, compiling with the ifort switches: /O3 /Qopt-prefetch=2 /Qopt-matmul /Qmkl:parallel, this replaces MatMul calls with calls to the Intel MKL library), Python (using the current Anaconda version, including Anaconda Accelerate, which supplies NumPy 1.9.2 linked with the Intel MKL library) and MATLAB R2015a (which, again, does matrix multiplication using the Intel MKL library).
Seeing as how all three implementations utilize the same Intel MKL library for matrix multiplication, I would expect the results to be virtually identical, especially for matrices that are sufficiently large for function call overhead to become negligible. However, this is far from the case, while MATLAB and Python display virtually identical performance, Fortran beats both by a factor of 2-3x. I'd like to understand why.
Here is the code I've used for the Fortran version:
integer, parameter :: N = 1024
integer :: i, j, cr, cm
real*8 :: t0, t1, rate
real*8 :: A(N,N), B(N,N), C(N,N)
! First initialize the system_clock
rate = real(cr)
WRITE(*,*) "system_clock rate: ", rate
do i = 1, 100, 1
write(unit=*, fmt="(a24,f10.5,a2)") "Average time spent: ", (t1-t0), "ms"
write(unit=*, fmt="(a24,f10.3)") "First element of C: ", C(1,1)
end program MatMulTest
Do note that if your system clock rate is not 10000 as in my case, you need to modify the timing calculation accordingly to yield milliseconds.
The Python code:
import numpy as np
A = np.random.rand(N,N)
B = np.random.rand(N,N)
for i in range(100):
C = np.dot(A,B)
if __name__ == "__main__":
N = 1024
t0 = time.clock()
t1 = time.clock()
print "Time elapsed: " + str((t1-t0)*10) + " ms"
And, finally, the MATLAB snippet:
disp(['Time elapsed: ', num2str(t*10), ' milliseconds'])
On my system, the results are as follows:
Fortran: 38.08 ms
Python: 104.29 ms
MATLAB: 97.36 ms
CPU use is indistinguishable in all three cases (using a steady 47-49% on an i7-920D0 processor w/ HT enabled for the duration of the calculation). Furthermore, the relative performance stays roughly equal for arbitrary matrix sizes with the exception that for very small matrices (N<80 or so) it is useful to manually disable parallelization in Fortran.
Is there any established reason for the discrepancy here? Am I doing something wrong? I would expect that at least for larger matrices Fortran would have no meaningful advantage in this case.
You have two issues here:
Just fix these two things and retry... You might consider using
date_and_time() rather than
cpu_time() for that purpose.