当前位置: 动力学知识库 > 问答 > 编程问答 >

C++ vs Java? Why does the ICC generate slower code than VC?

问题描述:

The following is a simple loop in C++. The timer is using QueryPerformanceCounter() and is quite accurate. I found Java to take 60% of the time C++ takes and this can't be?! What am I doing wrong here? Even strict aliasing (which is not included in the code here) doesn't help at all...

long long var = 0;

std::array<int, 1024> arr;

int* arrPtr = arr.data();

CHighPrecisionTimer timer;

for(int i = 0; i < 1024; i++) arrPtr[i] = i;

timer.Start();

for(int i = 0; i < 1024 * 1024 * 10; i++){

for(int x = 0; x < 1024; x++){

var += arrPtr[x];

}

}

timer.Stop();

printf("Unrestricted: %lld us, Value = %lld\n", (Int64)timer.GetElapsed().GetMicros(), var);

This C++ runs through in about 9.5 seconds. I am using the Intel Compiler 12.1 with host processor optimization (specifically for mine) and everything maxed. So this is Intel Compiler at its best! Auto-Parallelization funnily consumes 70% CPU instead of 25% but doesn't get the job done any faster ;)...

Now I use the following Java code for comparison:

 long var = 0;

int[] arr = new int[1024];

for(int i = 0; i < 1024; i++) arr[i] = i;

for(int i = 0; i < 1024 * 1024; i++){

for(int x = 0; x < 1024; x++){

var += arr[x];

}

}

long nanos = System.nanoTime();

for(int i = 0; i < 1024 * 1024 * 10; i++){

for(int x = 0; x < 1024; x++){

var += arr[x];

}

}

nanos = (System.nanoTime() - nanos) / 1000;

System.out.print("Value: " + var + ", Time: " + nanos);

The Java code is invoked with aggressive optimization and the server VM (no debug). It runs in about 7 seconds on my machine (only uses one thread).

Is this a failure of the Intel Compiler or am I just too dumb again?

[EDIT]: Ok now heres the thing... Seems more like a bug in the Intel compiler ^^.

[Please note that I am running on the Intel Quadcore Q6600, which is rather old. And it might be that the Intel Compiler performs way better on recent CPUs, like Core i7]

Intel x86 (without vectorization): 3 seconds

MSVC x64: 5 seconds

Java x86/x64 (Oracle Java 7): 7 seconds

Intel x64 (with vectorization): 9.5 seconds

Intel x86 (with vectorization): 9.5 seconds

Intel x64 (without vectorization): 12 seconds

MSVC x86: 15 seconds (uhh)

[EDIT]: Another nice case ;). Consider the following trivial lambda expression

#include <stdio.h>

#include <tchar.h>

#include <Windows.h>

#include <vector>

#include <boost/function.hpp>

#include <boost/lambda/bind.hpp>

#include <boost/typeof/typeof.hpp>

template<class TValue>

struct ArrayList

{

private:

std::vector<TValue> m_Entries;

public:

template<class TCallback>

void Foreach(TCallback inCallback)

{

for(int i = 0, size = m_Entries.size(); i < size; i++)

{

inCallback(i);

}

}

void Add(TValue inValue)

{

m_Entries.push_back(inValue);

}

};

int _tmain(int argc, _TCHAR* argv[])

{

auto t = [&]() {};

ArrayList<int> arr;

int res = 0;

for(int i = 0; i < 100; i++)

{

arr.Add(i);

}

long long freq, t1, t2;

QueryPerformanceFrequency((LARGE_INTEGER*)&freq);

QueryPerformanceCounter((LARGE_INTEGER*)&t1);

for(int i = 0; i < 1000 * 1000 * 10; i++)

{

arr.Foreach([&](int v) {

res += i;

});

}

QueryPerformanceCounter((LARGE_INTEGER*)&t2);

printf("Time: %lld\n", ((t2-t1) * 1000000) / freq);

if(res == 4950)

return -1;

return 0;

}

Intel compiler shines again:

MSVC x86/x64: 12 milli seconds

Intel x86/x64: 1 second

Uhm?! Well, I guess 90 times slower is not a bad thing...

I am not really sure anymore that this applies:

Okay and based on an answer to this thread: The intel compiler is known (and I knew that too but I just didn't think about that they could drop support for their processors) to have terrible performance on processors which are not "known" to the compiler, like AMD processors, and maybe even outdated Intel processors like mine... So if someone with a recent Intel processor could try this out it would be nice ;).

Here is the x64 output of the Intel Compiler:

 std::array<int, 1024> arr;

int* arrPtr = arr.data();

QueryPerformanceFrequency((LARGE_INTEGER*)&freq);

000000013F05101D lea rcx,[freq]

000000013F051022 call qword ptr [__imp_QueryPerformanceFrequency (13F052000h)]

for(int i = 0; i < 1024; i++) arrPtr[i] = i;

000000013F051028 mov eax,4

000000013F05102D movd xmm0,eax

000000013F051031 xor eax,eax

000000013F051033 pshufd xmm1,xmm0,0

000000013F051038 movdqa xmm0,xmmword ptr [__xi_z+28h (13F0521A0h)]

000000013F051040 movdqa xmmword ptr arr[rax*4],xmm0

000000013F051046 paddd xmm0,xmm1

000000013F05104A movdqa xmmword ptr [rsp+rax*4+60h],xmm0

000000013F051050 paddd xmm0,xmm1

000000013F051054 movdqa xmmword ptr [rsp+rax*4+70h],xmm0

000000013F05105A paddd xmm0,xmm1

000000013F05105E movdqa xmmword ptr [rsp+rax*4+80h],xmm0

000000013F051067 add rax,10h

000000013F05106B paddd xmm0,xmm1

000000013F05106F cmp rax,400h

000000013F051075 jb wmain+40h (13F051040h)

QueryPerformanceCounter((LARGE_INTEGER*)&t1);

000000013F051077 lea rcx,[t1]

000000013F05107C call qword ptr [__imp_QueryPerformanceCounter (13F052008h)]

var += arrPtr[x];

000000013F051082 movdqa xmm1,xmmword ptr [__xi_z+38h (13F0521B0h)]

for(int i = 0; i < 1024 * 1024 * 10; i++){

000000013F05108A xor eax,eax

var += arrPtr[x];

000000013F05108C movdqa xmm0,xmmword ptr [__xi_z+48h (13F0521C0h)]

long long var = 0, freq, t1, t2;

000000013F051094 pxor xmm6,xmm6

for(int x = 0; x < 1024; x++){

000000013F051098 xor r8d,r8d

var += arrPtr[x];

000000013F05109B lea rdx,[arr]

000000013F0510A0 xor ecx,ecx

000000013F0510A2 movq xmm2,mmword ptr arr[rcx]

for(int x = 0; x < 1024; x++){

000000013F0510A8 add r8,8

var += arrPtr[x];

000000013F0510AC punpckldq xmm2,xmm2

for(int x = 0; x < 1024; x++){

000000013F0510B0 add rcx,20h

var += arrPtr[x];

000000013F0510B4 movdqa xmm3,xmm2

000000013F0510B8 pand xmm2,xmm0

000000013F0510BC movq xmm4,mmword ptr [rdx+8]

000000013F0510C1 psrad xmm3,1Fh

000000013F0510C6 punpckldq xmm4,xmm4

000000013F0510CA pand xmm3,xmm1

000000013F0510CE por xmm3,xmm2

000000013F0510D2 movdqa xmm5,xmm4

000000013F0510D6 movq xmm2,mmword ptr [rdx+10h]

000000013F0510DB psrad xmm5,1Fh

000000013F0510E0 punpckldq xmm2,xmm2

000000013F0510E4 pand xmm5,xmm1

000000013F0510E8 paddq xmm6,xmm3

000000013F0510EC pand xmm4,xmm0

000000013F0510F0 movdqa xmm3,xmm2

000000013F0510F4 por xmm5,xmm4

000000013F0510F8 psrad xmm3,1Fh

000000013F0510FD movq xmm4,mmword ptr [rdx+18h]

000000013F051102 pand xmm3,xmm1

000000013F051106 punpckldq xmm4,xmm4

000000013F05110A pand xmm2,xmm0

000000013F05110E por xmm3,xmm2

000000013F051112 movdqa xmm2,xmm4

000000013F051116 paddq xmm6,xmm5

000000013F05111A psrad xmm2,1Fh

000000013F05111F pand xmm4,xmm0

000000013F051123 pand xmm2,xmm1

for(int x = 0; x < 1024; x++){

000000013F051127 add rdx,20h

var += arrPtr[x];

000000013F05112B paddq xmm6,xmm3

000000013F05112F por xmm2,xmm4

for(int x = 0; x < 1024; x++){

000000013F051133 cmp r8,400h

var += arrPtr[x];

000000013F05113A paddq xmm6,xmm2

for(int x = 0; x < 1024; x++){

000000013F05113E jb wmain+0A2h (13F0510A2h)

for(int i = 0; i < 1024 * 1024 * 10; i++){

000000013F051144 inc eax

000000013F051146 cmp eax,0A00000h

000000013F05114B jb wmain+98h (13F051098h)

}

}

QueryPerformanceCounter((LARGE_INTEGER*)&t2);

000000013F051151 lea rcx,[t2]

000000013F051156 call qword ptr [__imp_QueryPerformanceCounter (13F052008h)]

printf("Unrestricted: %lld ms, Value = %lld\n", ((t2-t1)*1000/freq), var);

000000013F05115C mov r9,qword ptr [t2]

long long var = 0, freq, t1, t2;

000000013F051161 movdqa xmm0,xmm6

printf("Unrestricted: %lld ms, Value = %lld\n", ((t2-t1)*1000/freq), var);

000000013F051165 sub r9,qword ptr [t1]

000000013F05116A lea rcx,[string "Unrestricted: %lld ms, Value = %"... (13F0521D0h)]

000000013F051171 imul rax,r9,3E8h

000000013F051178 cqo

000000013F05117A mov r10,qword ptr [freq]

000000013F05117F idiv rax,r10

long long var = 0, freq, t1, t2;

000000013F051182 psrldq xmm0,8

printf("Unrestricted: %lld ms, Value = %lld\n", ((t2-t1)*1000/freq), var);

000000013F051187 mov rdx,rax

long long var = 0, freq, t1, t2;

000000013F05118A paddq xmm6,xmm0

000000013F05118E movd r8,xmm6

printf("Unrestricted: %lld ms, Value = %lld\n", ((t2-t1)*1000/freq), var);

000000013F051193 call qword ptr [__imp_printf (13F052108h)]

And this one is the assembly of the MSVC x64 build:

int _tmain(int argc, _TCHAR* argv[])

{

000000013FF61000 push rbx

000000013FF61002 mov eax,1050h

000000013FF61007 call __chkstk (13FF61950h)

000000013FF6100C sub rsp,rax

000000013FF6100F mov rax,qword ptr [__security_cookie (13FF63000h)]

000000013FF61016 xor rax,rsp

000000013FF61019 mov qword ptr [rsp+1040h],rax

long long var = 0, freq, t1, t2;

std::array<int, 1024> arr;

int* arrPtr = arr.data();

QueryPerformanceFrequency((LARGE_INTEGER*)&freq);

000000013FF61021 lea rcx,[rsp+28h]

000000013FF61026 xor ebx,ebx

000000013FF61028 call qword ptr [__imp_QueryPerformanceFrequency (13FF62000h)]

for(int i = 0; i < 1024; i++) arrPtr[i] = i;

000000013FF6102E xor r11d,r11d

000000013FF61031 lea rax,[rsp+40h]

000000013FF61036 mov dword ptr [rax],r11d

000000013FF61039 inc r11d

000000013FF6103C add rax,4

000000013FF61040 cmp r11d,400h

000000013FF61047 jl wmain+36h (13FF61036h)

QueryPerformanceCounter((LARGE_INTEGER*)&t1);

000000013FF61049 lea rcx,[rsp+20h]

000000013FF6104E call qword ptr [__imp_QueryPerformanceCounter (13FF62008h)]

000000013FF61054 mov r11d,0A00000h

000000013FF6105A nop word ptr [rax+rax]

for(int i = 0; i < 1024 * 1024 * 10; i++){

for(int x = 0; x < 1024; x++){

000000013FF61060 xor edx,edx

000000013FF61062 xor r8d,r8d

000000013FF61065 lea rcx,[rsp+48h]

000000013FF6106A xor r9d,r9d

000000013FF6106D mov r10d,100h

000000013FF61073 nop word ptr [rax+rax]

var += arrPtr[x];

000000013FF61080 movsxd rax,dword ptr [rcx-8]

000000013FF61084 add rcx,10h

000000013FF61088 add rbx,rax

000000013FF6108B movsxd rax,dword ptr [rcx-14h]

000000013FF6108F add r9,rax

000000013FF61092 movsxd rax,dword ptr [rcx-10h]

000000013FF61096 add r8,rax

000000013FF61099 movsxd rax,dword ptr [rcx-0Ch]

000000013FF6109D add rdx,rax

000000013FF610A0 dec r10

000000013FF610A3 jne wmain+80h (13FF61080h)

for(int i = 0; i < 1024 * 1024 * 10; i++){

for(int x = 0; x < 1024; x++){

000000013FF610A5 lea rax,[rdx+r8]

000000013FF610A9 add rax,r9

000000013FF610AC add rbx,rax

000000013FF610AF dec r11

000000013FF610B2 jne wmain+60h (13FF61060h)

}

}

QueryPerformanceCounter((LARGE_INTEGER*)&t2);

000000013FF610B4 lea rcx,[rsp+30h]

000000013FF610B9 call qword ptr [__imp_QueryPerformanceCounter (13FF62008h)]

printf("Unrestricted: %lld ms, Value = %lld\n", ((t2-t1)*1000/freq), var);

000000013FF610BF mov rax,qword ptr [rsp+30h]

000000013FF610C4 lea rcx,[string "Unrestricted: %lld ms, Value = %"... (13FF621B0h)]

000000013FF610CB sub rax,qword ptr [rsp+20h]

000000013FF610D0 mov r8,rbx

000000013FF610D3 imul rax,rax,3E8h

000000013FF610DA cqo

000000013FF610DC idiv rax,qword ptr [rsp+28h]

000000013FF610E1 mov rdx,rax

000000013FF610E4 call qword ptr [__imp_printf (13FF62138h)]

return 0;

000000013FF610EA xor eax,eax

Intel Compiler configured without Vectorization, 64-Bit, highest optimizations (this is surprisingly slow, 12 seconds):

000000013FC0102F lea rcx,[freq]

double var = 0; long long freq, t1, t2;

000000013FC01034 xorps xmm6,xmm6

std::array<double, 1024> arr;

double* arrPtr = arr.data();

QueryPerformanceFrequency((LARGE_INTEGER*)&freq);

000000013FC01037 call qword ptr [__imp_QueryPerformanceFrequency (13FC02000h)]

for(int i = 0; i < 1024; i++) arrPtr[i] = i;

000000013FC0103D mov eax,2

000000013FC01042 mov rdx,100000000h

000000013FC0104C movd xmm0,eax

000000013FC01050 xor eax,eax

000000013FC01052 pshufd xmm1,xmm0,0

000000013FC01057 movd xmm0,rdx

000000013FC0105C nop dword ptr [rax]

000000013FC01060 cvtdq2pd xmm2,xmm0

000000013FC01064 paddd xmm0,xmm1

000000013FC01068 cvtdq2pd xmm3,xmm0

000000013FC0106C paddd xmm0,xmm1

000000013FC01070 cvtdq2pd xmm4,xmm0

000000013FC01074 paddd xmm0,xmm1

000000013FC01078 cvtdq2pd xmm5,xmm0

000000013FC0107C movaps xmmword ptr arr[rax*8],xmm2

000000013FC01081 paddd xmm0,xmm1

000000013FC01085 movaps xmmword ptr [rsp+rax*8+60h],xmm3

000000013FC0108A movaps xmmword ptr [rsp+rax*8+70h],xmm4

000000013FC0108F movaps xmmword ptr [rsp+rax*8+80h],xmm5

000000013FC01097 add rax,8

000000013FC0109B cmp rax,400h

000000013FC010A1 jb wmain+60h (13FC01060h)

QueryPerformanceCounter((LARGE_INTEGER*)&t1);

000000013FC010A3 lea rcx,[t1]

000000013FC010A8 call qword ptr [__imp_QueryPerformanceCounter (13FC02008h)]

for(int i = 0; i < 1024 * 1024 * 10; i++){

000000013FC010AE xor eax,eax

for(int x = 0; x < 1024; x++){

000000013FC010B0 xor edx,edx

var += arrPtr[x];

000000013FC010B2 lea ecx,[rdx+rdx]

for(int x = 0; x < 1024; x++){

000000013FC010B5 inc edx

for(int x = 0; x < 1024; x++){

000000013FC010B7 cmp edx,200h

var += arrPtr[x];

000000013FC010BD addsd xmm6,mmword ptr arr[rcx*8]

000000013FC010C3 addsd xmm6,mmword ptr [rsp+rcx*8+58h]

for(int x = 0; x < 1024; x++){

000000013FC010C9 jb wmain+0B2h (13FC010B2h)

for(int i = 0; i < 1024 * 1024 * 10; i++){

000000013FC010CB inc eax

000000013FC010CD cmp eax,0A00000h

000000013FC010D2 jb wmain+0B0h (13FC010B0h)

}

}

QueryPerformanceCounter((LARGE_INTEGER*)&t2);

000000013FC010D4 lea rcx,[t2]

000000013FC010D9 call qword ptr [__imp_QueryPerformanceCounter (13FC02008h)]

Intel Compiler without vectorization, 32-Bit and highest optimization (this one clearly is the winner now, runs in about 3 seconds and the assembly looks much better):

00B81088 lea eax,[t1]

00B8108C push eax

00B8108D call dword ptr [[email protected] (0B82004h)]

00B81093 xor eax,eax

00B81095 pxor xmm0,xmm0

00B81099 movaps xmm1,xmm0

for(int x = 0; x < 1024; x++){

00B8109C xor edx,edx

var += arrPtr[x];

00B8109E addpd xmm0,xmmword ptr arr[edx*8]

00B810A4 addpd xmm1,xmmword ptr [esp+edx*8+40h]

00B810AA addpd xmm0,xmmword ptr [esp+edx*8+50h]

00B810B0 addpd xmm1,xmmword ptr [esp+edx*8+60h]

for(int x = 0; x < 1024; x++){

00B810B6 add edx,8

00B810B9 cmp edx,400h

00B810BF jb wmain+9Eh (0B8109Eh)

for(int i = 0; i < 1024 * 1024 * 10; i++){

00B810C1 inc eax

00B810C2 cmp eax,0A00000h

00B810C7 jb wmain+9Ch (0B8109Ch)

double var = 0; long long freq, t1, t2;

00B810C9 addpd xmm0,xmm1

}

}

QueryPerformanceCounter((LARGE_INTEGER*)&t2);

00B810CD lea eax,[t2]

00B810D1 push eax

00B810D2 movaps xmmword ptr [esp+4],xmm0

00B810D7 call dword ptr [[email protected] (0B82004h)]

00B810DD movaps xmm0,xmmword ptr [esp]

网友答案:

tl;dr: What you're seeing here seems to be ICC's failed attempt at vectorizing the loop.

Let's start with MSVC x64:

Here's the critical loop:

[email protected]:
movsxd  rax, DWORD PTR [rdx-4]
movsxd  rcx, DWORD PTR [rdx-8]
add rdx, 16
add r10, rax
movsxd  rax, DWORD PTR [rdx-16]
add rbx, rcx
add r9, rax
movsxd  rax, DWORD PTR [rdx-12]
add r8, rax
dec r11
jne SHORT [email protected]

What you see here is the standard loop unrolling by the compiler. MSVC is unrolling to 4 iterations, and splitting the var variable across four registers: r10, rbx, r9, and r8. Then at the end of the loop, these 4 registers are summed up back together.

Here's where the 4 sums are recombined:

lea rax, QWORD PTR [r8+r9]
add rax, r10
add rbx, rax
dec rdi
jne SHORT [email protected]

Note that MSVC currently does not do automatic vectorization.


Now let's look at part of your ICC output:

000000013F0510A2  movq        xmm2,mmword ptr arr[rcx]  
000000013F0510A8  add         r8,8  
000000013F0510AC  punpckldq   xmm2,xmm2  
000000013F0510B0  add         rcx,20h  
000000013F0510B4  movdqa      xmm3,xmm2  
000000013F0510B8  pand        xmm2,xmm0  
000000013F0510BC  movq        xmm4,mmword ptr [rdx+8]  
000000013F0510C1  psrad       xmm3,1Fh  
000000013F0510C6  punpckldq   xmm4,xmm4  
000000013F0510CA  pand        xmm3,xmm1  
000000013F0510CE  por         xmm3,xmm2  
000000013F0510D2  movdqa      xmm5,xmm4  
000000013F0510D6  movq        xmm2,mmword ptr [rdx+10h]  
000000013F0510DB  psrad       xmm5,1Fh  
000000013F0510E0  punpckldq   xmm2,xmm2  
000000013F0510E4  pand        xmm5,xmm1  
000000013F0510E8  paddq       xmm6,xmm3  

...

What you're seeing here is an attempt by ICC to vectorize this loop. This is done in a similar manner as what MSVC did (splitting into multiple sums), but using SSE registers instead and with two sums per register.

But it turns out that the overhead of vectorization happens to outweigh the benefits of vectorizing.

If we walk these instructions down one-by-one, we can see how ICC tries to vectorize it:

//  Load two ints using a 64-bit load.  {x, y, 0, 0}
movq        xmm2,mmword ptr arr[rcx]  

//  Shuffle the data into this form.
punpckldq   xmm2,xmm2           xmm2 = {x, x, y, y}
movdqa      xmm3,xmm2           xmm3 = {x, x, y, y}

//  Mask out index 1 and 3.
pand        xmm2,xmm0           xmm2 = {x, 0, y, 0}

//  Arithmetic right-shift to copy sign-bit across the word.
psrad       xmm3,1Fh            xmm3 = {sign(x), sign(x), sign(y), sign(y)}

//  Mask out index 0 and 2.
pand        xmm3,xmm1           xmm3 = {0, sign(x), 0, sign(y)}

//  Combine to get sign-extended values.
por         xmm3,xmm2           xmm3 = {x, sign(x), y, sign(y)}
                                xmm3 = {x, y}

//  Add to accumulator...
paddq       xmm6,xmm3

So it's doing some very messy unpacking just to vectorize. The mess comes from needing to sign-extend the 32-bit integers to 64-bit using only SSE instructions.

SSE4.1 actually provides the PMOVSXDQ instruction for this purpose. But either the target machine doesn't support SSE4.1, or ICC isn't smart enough to use it in this case.

But the point is:

The Intel compiler is trying to vectorize the loop. But the overhead added seems to outweigh the benefit of vectorizing it in the first place. Hence why it's slower.


EDIT : Update with OP's results on:

  • ICC x64 no vectorization
  • ICC x86 with vectorization

You changed the data-type to double. So now it's floating-point. There's no more of that ugly sign-fill shifts that were plaguing the integer version.

But since you disabled vectorization for the x64 version, it obviously becomes slower.

ICC x86 with vectorization:

00B8109E  addpd       xmm0,xmmword ptr arr[edx*8]  
00B810A4  addpd       xmm1,xmmword ptr [esp+edx*8+40h]  
00B810AA  addpd       xmm0,xmmword ptr [esp+edx*8+50h]  
00B810B0  addpd       xmm1,xmmword ptr [esp+edx*8+60h]  
00B810B6  add         edx,8  
00B810B9  cmp         edx,400h  
00B810BF  jb          wmain+9Eh (0B8109Eh)  

Not much here - standard vectorization + 4x loop-unrolling.

ICC x64 with no vectorization:

000000013FC010B2  lea         ecx,[rdx+rdx]  
000000013FC010B5  inc         edx  
000000013FC010B7  cmp         edx,200h  
000000013FC010BD  addsd       xmm6,mmword ptr arr[rcx*8]  
000000013FC010C3  addsd       xmm6,mmword ptr [rsp+rcx*8+58h]  
000000013FC010C9  jb          wmain+0B2h (13FC010B2h)  

No vectorization + only 2x loop-unrolling.

All things equal, disabling vectorization will hurt performance in this floating-point case.

网友答案:

The example is simple enough that the different languages should not make a difference, and stupid enough not to prove anything. The loop could be optimised away by a compiler into a simple assignment, or left running for the whole number of iterations, or some of the iterations might be unrolled... I am not sure why you decided to write that test program, but it does not test anything regarding the languages, as once the logic optimisations are performed it all boils down to exactly the same assembly.

Also, regarding the performance of the intel compiler, it will greatly depend on the exact hardware and the compiler version. The compiler generates different versions of the code, and has a tendency to generate horrible code for AMD processors. Even for intel, if it does not recognise the specific processor it falls back into a safe slow mode.

网友答案:

when you have eliminated the impossible, whatever remains, however improbable, must be the truth.

You've got some data in one hand, and an assumption (C++ is always faster than Java) in the other. Why ask for people to justify your assumption when the data tells you otherwise?

If you wish to obtain assembly from the JVM in order to compare what's being run then the commandline option is '-XX:+PrintOptoAssembly', but you'll need to download a debug jvm in order to do so. Looking at the assembly would at least tell you why one is faster than the other.

网友答案:

Just for the record, I ran both codes on my box (x86_64 linux), the C++ with std::array, a plain int[1024] and, for completeness also with long instead of int. Java (open-jdk 1.6) clocked it at 3.8s, C++ (int) at 3.37s, and C++ (long) at 3.9s. My compiler was g++ 4.5.1. Maybe it's just Intel's compiler that's not as good as thought.

网友答案:

I think Java compiler implements JITC (Just in time compilation, or some more recent technology) to approach native compilers speed, and could infer that your array doesn't change, and thus could apply constant folding to the inner loop...

网友答案:

I suspect the culprit is simple loop unrolling. Replace

var += arrPtr[x];

with

var += arrPtr[x++];
var += arrPtr[x++];
var += arrPtr[x++];
var += arrPtr[x];

and observe how much faster the C++ version runs.

网友答案:

i see you are running the following loop

for(int i = 0; i < 1024 * 1024; i++){
        for(int x = 0; x < 1024; x++){
            var += arr[x];
        }
    }

twice in the Java code; while once in the c++ code; this might bring a caches warmup which makes the Java code finally execute faster than the C++.

分享给朋友:
您可能感兴趣的文章:
随机阅读: