vmovupd is just as fast as vmovapd when data is actually aligned at runtime. The only difference is that vmovapd crashes when data is not aligned. (See Optimization Links in the x86 tag wiki, especially the Agner Fog Optimization and PDF microarchitecture and Intel Optimization Guide .
You only have a problem if it uses multiple instructions instead of one.
Since you are using Intel's built-in functions for _mm256_mul_pd , use the load / store functions, not memcpy! See the sse wiki tag for tutorials on the embedded interface, etc.
// Hoist this outside the loop, // mostly for readability; should optimize fine either way. // Probably only aliasing-safe to use these pointers with _mm256_load/store (which alias anything) // unless C allows `double*` to alias `double complex*` const double *flat_filt = (const double*)filter->data; double *flat_data = (double*)data->data; for (...) { //union m256d a[2]; //union m256d b[2]; //union m256d r[2]; //memcpy(a, &( data->data[i*stride+j]), 2*sizeof(*a)); __m256d a0 = _mm256_load_pd(0 + &flat_data[i*stride+j]); __m256d a1 = _mm256_load_pd(4 + &flat_data[i*stride+j]); //memcpy(b, &(filter->data[i*stride+j]), 2*sizeof(*b)); __m256d b0 = _mm256_load_pd(0 + &flat_filt[i*stride+j]); __m256d b1 = _mm256_load_pd(4 + &flat_filt[i*stride+j]); // +4 doubles = +32 bytes = 1 YMM vector = +2 double complex __m256d r0 = _mm256_mul_pd(a0, b0); __m256d r1 = _mm256_mul_pd(a1, b1); // memcpy(&(data->data[i*stride+j]), r, 2*sizeof(*r)); _mm256_store_pd(0 + &flat_data[i*stride+j], r0); _mm256_store_pd(4 + &flat_data[i*stride+j], r1); }
If you want to have custom loading / storage, you should use _mm256_loadu_pd / storeu .
Or you could just point the double complex* to __m256d* and dereference it directly. In GCC, this is equivalent to the built-in load. But the usual convention is to use load / storage properties.
To answer the title question, you can help gcc auto-vectorize by indicating it when the pointer is aligned:
data = __builtin_assume_aligned(data, 64);
In C ++ you need to display the result, but in C void* freely discarded.
See How do I tell GCC that a pointer argument is always double-aligned? and https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html .
This, of course, is specific to GNU C / C ++ dialects (clang, gcc, icc), not portable to MSVC or other compilers that do not support GNU extensions.
So far, I have used roughly the same construction for all operations on the array.
Repeating through an array several times is usually worse than doing the maximum possible in one pass. Even if it all stays hot in L1D, just additional loading and storage instructions are a bottleneck compared to doing more while your data is in register.