Not vectorized: not suitable for collecting D.32476_34 = * D.32475_33;

I want my code to be auto-generated by the compiler, but I cannot figure out that this is correct. In particular, the message that I get from him using -ftree-vectorizer-verbose=6 option on - 125: not vectorized: not suitable for gather D.32476_34 = *D.32475_33; .

Now my question is what does this message mean and what do these numbers mean?

Bellow, I created a simple test example that creates the same message, so I assume the problems will be related.

 static void not_suitable_for_gather(unsigned char * __restrict__ pixels, int * __restrict__ indices, int indices_num) { for (int i = 0; i < indices_num; ++i) { int idx = indices[i] * 4; float r = pixels[idx + 0]; float g = pixels[idx + 1]; float b = pixels[idx + 2]; float a = pixels[idx + 3] / 255.0f; pixels[idx + 0] = r; pixels[idx + 1] = g; pixels[idx + 2] = b; pixels[idx + 3] = a * 255.0f; } return; } 

In addition, when creating my example, I came across a number of other messages that I am not quite sure of their meaning or why a particular design is problematic for vectorization, is there a manual, book, textbook, blog, anything that will explain these things to me ?

If that matters, I am using MingW 4.7 32-bit with QtCreator 2.7.0.

EDIT: conclusion:

In accordance with my tests and suggestions of this post, the message is most likely related to accessing the data indirectly through an auxiliary index array, which leads to the collection / scattering of the addressing scheme and currently GCC cannot (or does not want) to vectorize it. I was able to create vectorized code with clang++ 3.2-1 , though.

+4
source share
2 answers

An embedded version of your code will look conceptually (using OpenCL syntax):

 for (int i = 0; i < indices_num; ++i) { int idx = indices[i] * 4; float4 factor = (1, 1, 1, 255.0f); char4 x1 = vload4(idx, pixels); // Line A float4 x2 = convert_float4(x1); float4 x3 = x2 / factor; float4 x4 = x3 * factor; char4 x5 = convert_char4(x4); vstore4(x5, idx, pixels); // Line B } 

But hold on; in line A you are trying to load four characters (aka uint8) from memory and save them in line B. This is not a common feature with x86; the only instruction sets that I know about this support are support for AVX2 (Intel Haswells and later) and Xeon Phi's. If you do not compile one of these, this may explain why your compiler rejects this vectorization feature.

The compiler can, of course, individually load 4 uint8s, build a vector from them, perform the required vector operations and manually save 4 values ​​back; but I guess that without glitches and scatterings, individually loading and storing values ​​was probably considered too expensive compared to the amount of actual work you save by vectorizing.

+2
source

Try this code, which has vectors for repeatedly (and dividing) your vexerized variables .:

 static void not_suitable_for_gather(unsigned char * __restrict__ pixels, int * __restrict__ indices, int indices_num) { float dividerV[4]={1.0f,1.0f,1.0f,255.0f}; float multiplierV[4]={1.0f,1.0f,1.0f,255.0f}; //choose anything that suits //Can use same vector to both multiply and divide if you want. But having different vectors can give some more pipelining(also needs more mem acccess so pick carefully) for (int i = 0; i < indices_num; ++i) { int idx = indices[i] * 4; float r = pixels[idx + 0]/dividerV[0]; float g = pixels[idx + 1]/dividerV[1]; float b = pixels[idx + 2]/dividerV[2]; float a = pixels[idx + 3]/dividerV[3]; pixels[idx + 0] = r*multiplierV[0]; pixels[idx + 1] = g*multiplierV[1]; pixels[idx + 2] = b*multiplierV[2]; pixels[idx + 3] = a*multiplierV[3]; } return; } 

Perhaps this is easier to vectorize.

Unknown Aginst loop bounds, try giving a direct constant instead of index_num. This compiler is not entirely accurate (maybe, but I have not heard of other than java), so, perhaps, a constant working with the compiler may work.

Here:

 static void not_suitable_for_gather(unsigned char * __restrict__ pixels, int * __restrict__ indices) { float dividerV[4]={1.0f,1.0f,1.0f,255.0f}; float multiplierV[4]={1.0f,1.0f,1.0f,255.0f}; //choose anything that suits //Can use same vector to both multiply and divide if you want. But having different vectors can give some more pipelining(also needs more mem acccess so pick carefully) for (int i = 0; i < 1000; ++i) { int idx = indices[i] * 4; float r = pixels[idx + 0]/dividerV[0]; float g = pixels[idx + 1]/dividerV[1]; float b = pixels[idx + 2]/dividerV[2]; float a = pixels[idx + 3]/dividerV[3]; pixels[idx + 0] = r*multiplierV[0]; pixels[idx + 1] = g*multiplierV[1]; pixels[idx + 2] = b*multiplierV[2]; pixels[idx + 3] = a*multiplierV[3]; } return; } 

Sometimes arrays are not aligned properly to vectorize instructions. For example, cpu can only increase read / write performance for 32B (or 16B) aligned arrays. Ambiguous read / write is slower (or non-vectorizable)

Here:

 static void not_suitable_for_gather(unsigned char * __restrict__ pixels, int * __restrict__ indices) { float dividerV[4]={1.0f,1.0f,1.0f,255.0f}; float multiplierV[4]={1.0f,1.0f,1.0f,255.0f}; //choose anything that suits if(reinterpret_cast<size_t>pixels%32!=0) { printf("array is not aligned! need to shift array or need to do serial calc. until aligned offset reached!"); //do non-vectorized calc. When aligned offset reached, goto vectorizing code. } else { printf("array is aligned! Starting fast access."); for (int i = 0; i < 1000; ++i) { int idx = indices[i] * 4; float r = pixels[idx + 0]/dividerV[0]; float g = pixels[idx + 1]/dividerV[1]; float b = pixels[idx + 2]/dividerV[2]; float a = pixels[idx + 3]/dividerV[3]; pixels[idx + 0] = r*multiplierV[0]; pixels[idx + 1] = g*multiplierV[1]; pixels[idx + 2] = b*multiplierV[2]; pixels[idx + 3] = a*multiplierV[3]; } return; } } 

Perhaps someone can open memcpy or some array-copying asm file and paste some multiplication code into it and compile it as memcpy_with_multiplication (,,)?

The last sentence: wrap r, g, b, a in one array, so that they are in adjacent addresses. Here:

 static void not_suitable_for_gather(unsigned char * __restrict__ pixels, int * __restrict__ indices) { float dividerV[4]={1.0f,1.0f,1.0f,255.0f}; float multiplierV[4]={1.0f,1.0f,1.0f,255.0f}; //choose anything that suits //Can use same vector to both multiply and divide if you want. But having different vectors can give some more pipelining(also needs more mem acccess so pick carefully) for (int i = 0; i < 1000; ++i) { int idx = indices[i] * 4; float rgba[4]; rgba[0] = pixels[idx + 0]/dividerV[0]; rgba[1] = pixels[idx + 1]/dividerV[1]; rgba[2] = pixels[idx + 2]/dividerV[2]; rgba[3] = pixels[idx + 3]/dividerV[3]; pixels[idx + 0] = rgba[0]*multiplierV[0]; pixels[idx + 1] = rgba[1]*multiplierV[1]; pixels[idx + 2] = rgba[2]*multiplierV[2]; pixels[idx + 3] = rgba[3]*multiplierV[3]; } return; } 

"index [i]" is not an explicit argument to a pointer. It could be bad. Try a different way of showing this to the compiler. What happens when you put only i instead of the indices [i]? Will he compile this? indexes [i] cannot be known at compile time or are too complex for the compiler.

Simplification (also incorrect) and more vectorial:

 static void not_suitable_for_gather(unsigned char * __restrict__ pixels, int * __restrict__ indices) { float dividerV[4]={1.0f,1.0f,1.0f,255.0f}; float multiplierV[4]={1.0f,1.0f,1.0f,255.0f}; //choose anything that suits //you need to sorted version of indices[](or pixels[]) array to achieve something like this. for (int i = 0; i < 4000; i+=4) { float rgba[4]; rgba[0] = pixels[i + 0]/dividerV[0]; rgba[1] = pixels[i + 1]/dividerV[1]; rgba[2] = pixels[i + 2]/dividerV[2]; rgba[3] = pixels[i + 3]/dividerV[3]; pixels[i + 0] = rgba[0]*multiplierV[0]; pixels[i + 1] = rgba[1]*multiplierV[1]; pixels[i + 2] = rgba[2]*multiplierV[2]; pixels[i + 3] = rgba[3]*multiplierV[3]; } return; } 
+1
source

Source: https://habr.com/ru/post/1490675/


All Articles