How do I tell gcc that the data the pointer points to will always be aligned?

In my program (written in simple C), I have a structure that stores data ready for transformation using the vectorized (AVX only) conversion of radix-2 2D fast Fourier. The structure is as follows:

struct data { double complex *data; unsigned int width; unsigned int height; unsigned int stride; }; 

Now I need to load data from memory as quickly as possible. As far as I know, there is an indefinite and aligned load on ymm registers (vmovapd and vmovupd instructions), and I would like the program to use the aligned version as quickly as possible.

So far, I have used roughly the same construction for all operations on the array. This example is part of a program where data and a filter are already converted to the frequency domain, and the filter is applied to the data by multiplying the elements by the elements.

 union m256d { __m256d reg; double d[4]; }; struct data *data, *filter; /* Load data and filter here, both have the same width, height and stride. */ unsigned int stride = data->stride; for(unsigned int i = 0; i<data->height; i++) { for(unsigned int j = 0; j<data->width; j+=4) { union m256d a[2]; union m256d b[2]; union m256d r[2]; memcpy(a, &( data->data[i*stride+j]), 2*sizeof(*a)); memcpy(b, &(filter->data[i*stride+j]), 2*sizeof(*b)); r[0].reg = _mm256_mul_pd(a[0].reg, b[0].reg); r[1].reg = _mm256_mul_pd(a[1].reg, b[1].reg); memcpy(&(data->data[i*stride+j]), r, 2*sizeof(*r)); } } 

As expected, memcpy calls are optimized. However, after observation, gcc translates memcpy into either two vmovupd commands or a bunch of movq commands that load data into guaranteed stack alignment, and then two vmovapd commands that load it into ymm registers. It depends on whether the prototype memcpy is defined or not (if it is defined, then gcc uses movq and vmovapd).

I can make sure that the data in memory is aligned, but I'm not sure how to tell gcc that it can just use movapd commands to load data from memory directly into ymm registers. I strongly suspect that gcc does not know that the data pointed to by &(data->data[i*stride+j]) are always aligned.

Is it possible to tell gcc that the data the pointer points to will always be aligned?

+5
source share
2 answers

vmovupd is just as fast as vmovapd when data is actually aligned at runtime. The only difference is that vmovapd crashes when data is not aligned. (See Optimization Links in the tag wiki, especially the Agner Fog Optimization and PDF microarchitecture and Intel Optimization Guide .

You only have a problem if it uses multiple instructions instead of one.


Since you are using Intel's built-in functions for _mm256_mul_pd , use the load / store functions, not memcpy! See the sse wiki tag for tutorials on the embedded interface, etc.

 // Hoist this outside the loop, // mostly for readability; should optimize fine either way. // Probably only aliasing-safe to use these pointers with _mm256_load/store (which alias anything) // unless C allows `double*` to alias `double complex*` const double *flat_filt = (const double*)filter->data; double *flat_data = (double*)data->data; for (...) { //union m256d a[2]; //union m256d b[2]; //union m256d r[2]; //memcpy(a, &( data->data[i*stride+j]), 2*sizeof(*a)); __m256d a0 = _mm256_load_pd(0 + &flat_data[i*stride+j]); __m256d a1 = _mm256_load_pd(4 + &flat_data[i*stride+j]); //memcpy(b, &(filter->data[i*stride+j]), 2*sizeof(*b)); __m256d b0 = _mm256_load_pd(0 + &flat_filt[i*stride+j]); __m256d b1 = _mm256_load_pd(4 + &flat_filt[i*stride+j]); // +4 doubles = +32 bytes = 1 YMM vector = +2 double complex __m256d r0 = _mm256_mul_pd(a0, b0); __m256d r1 = _mm256_mul_pd(a1, b1); // memcpy(&(data->data[i*stride+j]), r, 2*sizeof(*r)); _mm256_store_pd(0 + &flat_data[i*stride+j], r0); _mm256_store_pd(4 + &flat_data[i*stride+j], r1); } 

If you want to have custom loading / storage, you should use _mm256_loadu_pd / storeu .

Or you could just point the double complex* to __m256d* and dereference it directly. In GCC, this is equivalent to the built-in load. But the usual convention is to use load / storage properties.


To answer the title question, you can help gcc auto-vectorize by indicating it when the pointer is aligned:

 data = __builtin_assume_aligned(data, 64); 

In C ++ you need to display the result, but in C void* freely discarded.

See How do I tell GCC that a pointer argument is always double-aligned? and https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html .

This, of course, is specific to GNU C / C ++ dialects (clang, gcc, icc), not portable to MSVC or other compilers that do not support GNU extensions.


So far, I have used roughly the same construction for all operations on the array.

Repeating through an array several times is usually worse than doing the maximum possible in one pass. Even if it all stays hot in L1D, just additional loading and storage instructions are a bottleneck compared to doing more while your data is in register.

+8
source

As Olaf pointed out, you can write the corresponding load and save functions. So, now the code translates well to two vmovapd commands for loading and two vmovapd commands for saving.

 static inline void mload(union m256d t[2], double complex *f) { t[0].reg = _mm256_load_pd((double *)f); t[1].reg = _mm256_load_pd((double *)(f+2)); } static inline void msave(union m256d f[2], double complex *t) { _mm256_store_pd((double *)t, f[0].reg); _mm256_store_pd((double *)(t+2), f[1].reg); } unsigned int stride = data->stride; for(unsigned int i = 0; i<data->height; i++) { for(unsigned int j = 0; j<data->width; j+=4) { union m256d a[2]; union m256d b[2]; union m256d r[2]; mload(a, &( data->data[i*stride+j])); mload(b, &(filter->data[i*stride+j])); r[0].reg = _mm256_mul_pd(a[0].reg, b[0].reg); r[1].reg = _mm256_mul_pd(a[1].reg, b[1].reg); msave(r, &(data->data[i*stride+j])); } } 
0
source

Source: https://habr.com/ru/post/1271818/


All Articles