I want to sum all 32bit element in a 256 register but there isn't any intrinsics instruction or if there is I couldn't help what I want. So I did some thing like this to sum but this method generates many assembly instruction when compiled.
My method :
vmovaps %ymm0, (%rsp)
vmovss (%rsp), %xmm0
vaddss 4(%rsp), %xmm0, %xmm0
vaddss 8(%rsp), %xmm0, %xmm0
vaddss 12(%rsp), %xmm0, %xmm0
vaddss 16(%rsp), %xmm0, %xmm0
vaddss 20(%rsp), %xmm0, %xmm0
vaddss 24(%rsp), %xmm0, %xmm0
vaddss 28(%rsp), %xmm0, %xmm0
vmovss %xmm0, c_result(%r8,%rsi)
So the question is how can I sum all elements faster and more professional and store them to the 32 bit array in memory? I tried
hadd but didn't improve the performance. because I still have memory problem to save them and also
hadd latency and throughput killing the time more than
You might start with the code any optimizing compiler generates for vectorized sum reduction with or without accumulate(), cilkplus reducer, or omp simd reduction. No doubt there is a step adding 128 bit sub registers, one with hadd, and so on.