Vectorization / optimization of a loop with unequal data access for wide registers (in particular, Xeon Phi)

This is my first experience asking questions to the Stackoverflow community. Sorry, if my question does not match the style / size of the forum, it will improve with experience.

I am trying to vectorize a cycle in C ++ using Intel Compiler 14.0.1, in order to better use wide 512-bit registers to optimize speed on Intel Xeon Phi. (inspired by https://software.intel.com/en-us/articles/data-alignment-to-assist-vectorization ) and numerous links to Google that data alignment is much more important on Xeon Phi than on modern Xeon processors, where (one of them is in a good review https://indico.cern.ch/event/238763/material/slides/6.pdf on page 18).

This question is somewhat similar to uneven access to memory , but covers a simpler / more common example and hopefully has a more definitive answer.

Example code snippet:

#include <malloc.h>


void func(float *const y, float  *const x, const int & N, const float & a0, const float & a1, const float & a2, const float & a3)
{
    __assume(N%16 == 0); // aim is to let compiler know that there is no residual loop (not sure if it works as expected, though)

    int i;
#pragma simd // to assume no vector dependencies
#pragma loop count min=16, avg=80, max=2048 // to let compiler know for which cases to optimize (not sure if it is beneficial)
//#pragma vector aligned // to let compiler know that all the arrays are aligned... but not in this case
    for (i = 0; i < N; i++)
    {
        y[i] = fmax(x[i + 1] * a0 + x[i] * a1, x[i] * a2 + a3);
    }

}

int main{

...
//y and x are _mm_malloced with 64 byte alignment, e.g.

float * y = (float *)_aligned_malloc(int_sizeBytes_x_or_y + 64, 64); //+64 for padding to enable vectorisation without using mask on the residual loop
float * x = (float *)_aligned_malloc(int_sizeBytes_x_or_y + 64, 64);
...
//M = 160 to 2048, more often 160 (a multiple of 16 - floats per register)
for (int k = 0; k < M; k++)
{
...
//int N = ceil(k / 16.0) * 16; // to have no residual loop, not sure if beneficial
...


func(y, x, N, a0, a1, a2, a3);


...
}
...
_aligned_free(x);
_aligned_free(y);
}

func () is called 150-2000 times in the body, reusing the previously allocated space for x and y (to avoid constant memory allocations, which are apparently relatively more time consuming on Phi than on normal Xeon). The body is repeated millions of times on each core.

The problem is that x [i] and x [i + 1] are inherently not tied to a 512-bit vector engine, which makes optimization of vectorization impossible due to incorrect memory access for part x [i + 1].

- , 64- _x k ++, memcpy x k ++? ( for (int j=0; j<N; j++) _x[0]=x[i+1]; with memcpy), #pragma func() y[i] = fmax(_x[i] * a0 + x[i] * a1, x[i] * a2 + a3);?

- , ?

- (, Intel , parallelism)

+4
1

, . : __assume_aligned (, 64) __assume_aligned (y, 64)

__assume (N% 16 == 0), , , , . , , N% 16 0, , . . , M.

, x [1]. .

, _mm512_alignr_epi32. , , . _mm512_alignr_epi32 , 2 .

+3

Source: https://habr.com/ru/post/1538310/


All Articles