问题描述:

I've written a trivial benchmark comparing matrix multiplication performance in three languages - Fortran (using Intel Parallel Studio 2015, compiling with the ifort switches: /O3 /Qopt-prefetch=2 /Qopt-matmul /Qmkl:parallel, this replaces MatMul calls with calls to the Intel MKL library), Python (using the current Anaconda version, including Anaconda Accelerate, which supplies NumPy 1.9.2 linked with the Intel MKL library) and MATLAB R2015a (which, again, does matrix multiplication using the Intel MKL library).

Seeing as how all three implementations utilize the same Intel MKL library for matrix multiplication, I would expect the results to be virtually identical, especially for matrices that are sufficiently large for function call overhead to become negligible. However, this is far from the case, while MATLAB and Python display virtually identical performance, Fortran beats both by a factor of 2-3x. I'd like to understand why.

Here is the code I've used for the Fortran version:

`program MatMulTest`

implicit none

integer, parameter :: N = 1024

integer :: i, j, cr, cm

real*8 :: t0, t1, rate

real*8 :: A(N,N), B(N,N), C(N,N)

call random_seed()

call random_number(A)

call random_number(B)

! First initialize the system_clock

CALL system_clock(count_rate=cr)

CALL system_clock(count_max=cm)

rate = real(cr)

WRITE(*,*) "system_clock rate: ", rate

call cpu_time(t0)

do i = 1, 100, 1

C=MatMul(A,B)

end do

call cpu_time(t1)

write(unit=*, fmt="(a24,f10.5,a2)") "Average time spent: ", (t1-t0), "ms"

write(unit=*, fmt="(a24,f10.3)") "First element of C: ", C(1,1)

end program MatMulTest

Do note that if your system clock rate is not 10000 as in my case, you need to modify the timing calculation accordingly to yield milliseconds.

The Python code:

`import time`

import numpy as np

def main(N):

A = np.random.rand(N,N)

B = np.random.rand(N,N)

for i in range(100):

C = np.dot(A,B)

print C[0,0]

if __name__ == "__main__":

N = 1024

t0 = time.clock()

main(N)

t1 = time.clock()

print "Time elapsed: " + str((t1-t0)*10) + " ms"

And, finally, the MATLAB snippet:

`N=1024;`

A=rand(N,N); B=rand(N,N);

tic;

for i=1:100

C=A*B;

end

t=toc;

disp(['Time elapsed: ', num2str(t*10), ' milliseconds'])

On my system, the results are as follows:

`Fortran: 38.08 ms`

Python: 104.29 ms

MATLAB: 97.36 ms

CPU use is indistinguishable in all three cases (using a steady 47-49% on an i7-920D0 processor w/ HT enabled for the duration of the calculation). Furthermore, the relative performance stays roughly equal for arbitrary matrix sizes with the exception that for very small matrices (N<80 or so) it is useful to manually disable parallelization in Fortran.

Is there any established reason for the discrepancy here? Am I doing something wrong? I would expect that at least for larger matrices Fortran would have no meaningful advantage in this case.

You have two issues here:

- In Python, you time the random initialisation as well as the computation, which you don't in Fortran and MATLAB
- In Fortran, you measure the CPU time while you measure the elapsed time in Python and MATLAB. And since you noticed that the CPU usage is around 46%, this might just account for the difference.

Just fix these two things and retry... You might consider using `date_and_time()`

rather than `cpu_time()`

for that purpose.

您可能感兴趣的文章：

- c# - How to quickly find the code file that a dialog belongs to in visual studio solution
- c# - Need a secure way to bypass License check condition in a public method
- csv - SSIS Foreach Loop issues with multiple files in the same directory
- xcode - Salesforce iOS SDK - linker issues with sqlite3 symbol
- C++ : passing a const *p in initialization list
- php - Reading a specific line from a text file
- clr - Any implementation of an Unrolled Linked List in C#?
- Finding Hudson Log Files
- Forward to a payment-gateway together with POST data using cURL (or any other PHP server side solution)
- WCF in Winforms app - is it always single-threaded?

随机阅读：

- 求下列各式中的x：（1）x3=-0.125；（2）8x3=27；（3）x3+2=1；（4）（x-1）3=8；（5）27x3=343；（6）3x3+0.648=0．
- 在150倍大气压强下，可将氧气压缩贮存于钢瓶里，由此主要说明A.分子的质量很小B.分子由原子构成C.分子在不断的运动D.分子之间有间隔
- tomcat - VisualVM fatal error detected by Java Runtime Environment - question about log file and error
- javascript - Backspace doesn't delete inner html tags of a contenteditable DIV in Firefox

**推荐内容**-

**热点内容**