当前位置: 动力学知识库 > 问答 > 编程问答 >

c - How to sum a vector elements using AVX?

问题描述:

I want to sum all 32bit element in a 256 register but there isn't any intrinsics instruction or if there is I couldn't help what I want. So I did some thing like this to sum but this method generates many assembly instruction when compiled.

My method :

_mm256_store_ps(&temp4[0], sum0_i);

c_result[i][j]= temp4[0]+temp4[1]+temp4[2]+temp4[3]+temp4[4]+temp4[5]+temp4[6]+temp4[7];

Assembly output:

 vmovaps %ymm0, (%rsp)

vmovss (%rsp), %xmm0

vaddss 4(%rsp), %xmm0, %xmm0

vaddss 8(%rsp), %xmm0, %xmm0

vaddss 12(%rsp), %xmm0, %xmm0

vaddss 16(%rsp), %xmm0, %xmm0

vaddss 20(%rsp), %xmm0, %xmm0

vaddss 24(%rsp), %xmm0, %xmm0

vaddss 28(%rsp), %xmm0, %xmm0

vmovss %xmm0, c_result(%r8,%rsi)

So the question is how can I sum all elements faster and more professional and store them to the 32 bit array in memory? I tried hadd but didn't improve the performance. because I still have memory problem to save them and also hadd latency and throughput killing the time more than vaddss

网友答案:

You might start with the code any optimizing compiler generates for vectorized sum reduction with or without accumulate(), cilkplus reducer, or omp simd reduction. No doubt there is a step adding 128 bit sub registers, one with hadd, and so on.

分享给朋友:
您可能感兴趣的文章:
随机阅读: