const int STRIDE=2,SIZE=8192;
#pragma vector aligned
The compiler uses xmm registers here. There is stride 2 access and I want to make the compiler ignore this and do a regular load of memory and then mask alternate bits so I would be using 50% of the SIMD registers. I need intrinsics which can be used to load and then mask the register bitwise before storing back to memory
P.S: I have never done assembly coding before
A masked store with a mask value as
You can't do a masked load (only a masked store). The easiest alternative would be to do a load and then mask it yourself (e.g. using intrinsics).
A potentially better alternative would be to change your array to "double u[STRIDE][SIZE];" so that you don't need to mask anything and don't end up with half an XMM register wasted/masked.
Without AVX, half a SIMD register is only one double anyway, so there seems little wrong with regular 64-bit stores.
If you want to use masked stores (MASKMOVDQU/MASKMOVQ), note that they write directly to DRAM just like the non-temporal stores like MOVNTPS. This may or may not be what you want. If the data fits in cache and you plan to read it soon, it is likely better not to use them.
Certain AMD processors can do a 64-bit non-temporal store from an XMM register using MOVNTSD; this may simplify things slightly compared to MASKMOVDQU).