This is my first experience asking questions to the Stackoverflow community. Sorry, if my question does not match the style / size of the forum, it will improve with experience.
I am trying to vectorize a cycle in C ++ using Intel Compiler 14.0.1, in order to better use wide 512-bit registers to optimize speed on Intel Xeon Phi. (inspired by https://software.intel.com/en-us/articles/data-alignment-to-assist-vectorization ) and numerous links to Google that data alignment is much more important on Xeon Phi than on modern Xeon processors, where (one of them is in a good review https://indico.cern.ch/event/238763/material/slides/6.pdf on page 18).
This question is somewhat similar to uneven access to memory , but covers a simpler / more common example and hopefully has a more definitive answer.
Example code snippet:
#include <malloc.h>
void func(float *const y, float *const x, const int & N, const float & a0, const float & a1, const float & a2, const float & a3)
{
__assume(N%16 == 0);
int i;
#pragma simd
#pragma loop count min=16, avg=80, max=2048
for (i = 0; i < N; i++)
{
y[i] = fmax(x[i + 1] * a0 + x[i] * a1, x[i] * a2 + a3);
}
}
int main{
...
float * y = (float *)_aligned_malloc(int_sizeBytes_x_or_y + 64, 64);
float * x = (float *)_aligned_malloc(int_sizeBytes_x_or_y + 64, 64);
...
for (int k = 0; k < M; k++)
{
...
...
func(y, x, N, a0, a1, a2, a3);
...
}
...
_aligned_free(x);
_aligned_free(y);
}
func () is called 150-2000 times in the body, reusing the previously allocated space for x and y (to avoid constant memory allocations, which are apparently relatively more time consuming on Phi than on normal Xeon). The body is repeated millions of times on each core.
The problem is that x [i] and x [i + 1] are inherently not tied to a 512-bit vector engine, which makes optimization of vectorization impossible due to incorrect memory access for part x [i + 1].
- , 64- _x k ++, memcpy x k ++? (
for (int j=0; j<N; j++) _x[0]=x[i+1]; with memcpy), #pragma func() y[i] = fmax(_x[i] * a0 + x[i] * a1, x[i] * a2 + a3);?
- , ?
- (, Intel , parallelism)