I am writing code using the built-in C functions for Intel AVX instructions. If I have a packed double vector (a __m256d
), which would be the most efficient way (i.e. the least number of operations) for storing each of them in a different place in memory (i.e. I need to deploy them to different ones where they are no longer packed)? Pseudocode:
__m256d *src; double *dst; int dst_dist; dst[0] = src[0]; dst[dst_dist] = src[1]; dst[2 * dst_dist] = src[2]; dst[3 * dst_dist] = src[3];
Using SSE, I could do this with the __m128
types, using the built-in functions _mm_storel_pi
and _mm_storeh_pi
. I could not find anything similar for AVX, which allows me to store individual 64-bit fragments in memory. Does it exist?
source share