Try this code, which has vectors for repeatedly (and dividing) your vexerized variables .:
static void not_suitable_for_gather(unsigned char * __restrict__ pixels, int * __restrict__ indices, int indices_num) { float dividerV[4]={1.0f,1.0f,1.0f,255.0f}; float multiplierV[4]={1.0f,1.0f,1.0f,255.0f};
Perhaps this is easier to vectorize.
Unknown Aginst loop bounds, try giving a direct constant instead of index_num. This compiler is not entirely accurate (maybe, but I have not heard of other than java), so, perhaps, a constant working with the compiler may work.
Here:
static void not_suitable_for_gather(unsigned char * __restrict__ pixels, int * __restrict__ indices) { float dividerV[4]={1.0f,1.0f,1.0f,255.0f}; float multiplierV[4]={1.0f,1.0f,1.0f,255.0f};
Sometimes arrays are not aligned properly to vectorize instructions. For example, cpu can only increase read / write performance for 32B (or 16B) aligned arrays. Ambiguous read / write is slower (or non-vectorizable)
Here:
static void not_suitable_for_gather(unsigned char * __restrict__ pixels, int * __restrict__ indices) { float dividerV[4]={1.0f,1.0f,1.0f,255.0f}; float multiplierV[4]={1.0f,1.0f,1.0f,255.0f}; //choose anything that suits if(reinterpret_cast<size_t>pixels%32!=0) { printf("array is not aligned! need to shift array or need to do serial calc. until aligned offset reached!"); //do non-vectorized calc. When aligned offset reached, goto vectorizing code. } else { printf("array is aligned! Starting fast access."); for (int i = 0; i < 1000; ++i) { int idx = indices[i] * 4; float r = pixels[idx + 0]/dividerV[0]; float g = pixels[idx + 1]/dividerV[1]; float b = pixels[idx + 2]/dividerV[2]; float a = pixels[idx + 3]/dividerV[3]; pixels[idx + 0] = r*multiplierV[0]; pixels[idx + 1] = g*multiplierV[1]; pixels[idx + 2] = b*multiplierV[2]; pixels[idx + 3] = a*multiplierV[3]; } return; } }
Perhaps someone can open memcpy or some array-copying asm file and paste some multiplication code into it and compile it as memcpy_with_multiplication (,,)?
The last sentence: wrap r, g, b, a in one array, so that they are in adjacent addresses. Here:
static void not_suitable_for_gather(unsigned char * __restrict__ pixels, int * __restrict__ indices) { float dividerV[4]={1.0f,1.0f,1.0f,255.0f}; float multiplierV[4]={1.0f,1.0f,1.0f,255.0f};
"index [i]" is not an explicit argument to a pointer. It could be bad. Try a different way of showing this to the compiler. What happens when you put only i instead of the indices [i]? Will he compile this? indexes [i] cannot be known at compile time or are too complex for the compiler.
Simplification (also incorrect) and more vectorial:
static void not_suitable_for_gather(unsigned char * __restrict__ pixels, int * __restrict__ indices) { float dividerV[4]={1.0f,1.0f,1.0f,255.0f}; float multiplierV[4]={1.0f,1.0f,1.0f,255.0f}; //choose anything that suits //you need to sorted version of indices[](or pixels[]) array to achieve something like this. for (int i = 0; i < 4000; i+=4) { float rgba[4]; rgba[0] = pixels[i + 0]/dividerV[0]; rgba[1] = pixels[i + 1]/dividerV[1]; rgba[2] = pixels[i + 2]/dividerV[2]; rgba[3] = pixels[i + 3]/dividerV[3]; pixels[i + 0] = rgba[0]*multiplierV[0]; pixels[i + 1] = rgba[1]*multiplierV[1]; pixels[i + 2] = rgba[2]*multiplierV[2]; pixels[i + 3] = rgba[3]*multiplierV[3]; } return; }