One option is to populate your array with a multiple of 16 bytes. You can then perform 128-bit download / add / save and simply ignore the results following what you care about.
For large arrays, although the overhead of the byte by the epilog byte will be very small. Sweep the loop can improve performance, for example:
for (; i < n32; i+=32) { xmm0 = _mm_load_si128( (__m128i*) (A + i) ); xmm1 = _mm_load_si128( (__m128i*) (B + i) ); xmm2 = _mm_load_si128( (__m128i*) (A + i + 16) ); xmm3 = _mm_load_si128( (__m128i*) (B + i + 16) ); xmm1 = _mm_add_epi8( xmm0, xmm1 ); xmm3 = _mm_add_epi8( xmm2, xmm3 ); _mm_store_si128( (__m128i*) (B + i), xmm1 ); _mm_store_si128( (__m128i*) (B + i + 16), xmm3 ); } // Do another 128 bit load/add/store here if required
But itβs hard to say without doing profiling.
You can also do an unnamed load / store at the end (assuming you have more than 16 bytes), although that probably won't make much difference. For instance. if you have 20 bytes, you do one load / save for offset 0 and another unbalanced load / add / save ( _mm_storeu_si128
, __mm_loadu_si128
) for offset 4.
You can use _mm_maskmoveu_si128
, but you need to get the mask into the xmm register, and your sample code will not work. You probably want to set the mask case for all FFs, and then use the shift to align it. At the end of the day, it will probably be slower than the non-exceeded load / add / save.
It will be something like:
mask = _mm_cmpeq_epi8(mask, mask); // Set to all FF's mask = _mm_srli_si128(mask, 16-(n%16)); // Align mask _mm_maskmoveu_si128(xmm, mask, A + i);
source share