SIMD array for arbitrary array length

I participate in the use of SIMD features by rewriting my personal image processing library using vector built-in functions. One of the main functions is a simple array += , "ie

 void arrayAdd(unsigned char* A, unsigned char* B, size_t n) { for(size_t i=0; i < n; i++) { B[i] += A[i] }; } 

For an arbitrary array length, the obvious SIMD code (assuming it is 16) looks something like this:

 size_t i = 0; __m128i xmm0, xmm1; n16 = n - (n % 16); for (; i < n16; i+=16) { xmm0 = _mm_load_si128( (__m128i*) (A + i) ); xmm1 = _mm_load_si128( (__m128i*) (B + i) ); xmm1 = _mm_add_epi8( xmm0, xmm1 ); _mm_store_si128( (__m128i*) (B + i), xmm1 ); } for (; i < n; i++) { B[i] += A[i]; } 

But is it possible to add all SIMD instructions? I thought to try this:

 __m128i mask = (0x100<<8*(n - n16))-1; _mm_maskmoveu_si128( xmm1, mask, (__m128i*) (B + i) ); 

for additional elements, but will this lead to undefined behavior? mask should ensure that access does not actually go through the boundaries of the array (I think). An alternative is to perform additional elements first, but then the array should be n-n16 , which seems wrong.

Is there another, more optimal pattern for such vectorized loops?

+6
source share
1 answer

One option is to populate your array with a multiple of 16 bytes. You can then perform 128-bit download / add / save and simply ignore the results following what you care about.

For large arrays, although the overhead of the byte by the epilog byte will be very small. Sweep the loop can improve performance, for example:

 for (; i < n32; i+=32) { xmm0 = _mm_load_si128( (__m128i*) (A + i) ); xmm1 = _mm_load_si128( (__m128i*) (B + i) ); xmm2 = _mm_load_si128( (__m128i*) (A + i + 16) ); xmm3 = _mm_load_si128( (__m128i*) (B + i + 16) ); xmm1 = _mm_add_epi8( xmm0, xmm1 ); xmm3 = _mm_add_epi8( xmm2, xmm3 ); _mm_store_si128( (__m128i*) (B + i), xmm1 ); _mm_store_si128( (__m128i*) (B + i + 16), xmm3 ); } // Do another 128 bit load/add/store here if required 

But it’s hard to say without doing profiling.

You can also do an unnamed load / store at the end (assuming you have more than 16 bytes), although that probably won't make much difference. For instance. if you have 20 bytes, you do one load / save for offset 0 and another unbalanced load / add / save ( _mm_storeu_si128 , __mm_loadu_si128 ) for offset 4.

You can use _mm_maskmoveu_si128 , but you need to get the mask into the xmm register, and your sample code will not work. You probably want to set the mask case for all FFs, and then use the shift to align it. At the end of the day, it will probably be slower than the non-exceeded load / add / save.

It will be something like:

 mask = _mm_cmpeq_epi8(mask, mask); // Set to all FF's mask = _mm_srli_si128(mask, 16-(n%16)); // Align mask _mm_maskmoveu_si128(xmm, mask, A + i); 
+4
source

Source: https://habr.com/ru/post/913280/


All Articles