How to make _mm256_maskstore_epi8 () in C / C ++?

Question

How to make _mm256_maskstore_epi8 () in C / C ++?

Problem

I am trying to do if I have a vector of 27 (not 32!) int8_t :

x = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26}

I want to first cycle it right to n ( not a constant ), for example. if n = 1:

x2 = {26,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25}

This vector is then used to perform a very complex calculation, but for simplicity, suppose that the next step is simply to cycle it left to n and store it in memory. Therefore, I should have a new vector 27 int8_t :

y = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26}

So, thousands of such vectors and performance are very important here. The processor we use supports AVX2, so we want to use it to speed things up.

My current solution

To get x2 , I use two _mm256_loadu_si256() with _mm256_blendv_epi8() :

 int8_t x[31+27+31]; for(int i=0; i<27; i++){ x[31+i] = i; } __m256i mask = _mm256_set_epi32 (0x0, 0x00800000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0); __m256i x_second_part = _mm256_loadu_si256((__m256i*)(x+31+1)); //{1,2,...,26} __m256i x_first_part = _mm256_loadu_si256((__m256i*)(x+31-26)); //{0} __m256i x2 = _mm256_blendv_epi8(x_second_part, x_first_part, mask); //{1,2,...,26, 0} int8_t y[31+27+31]; _mm256_storeu_si256((__m256i*)(y+31-26), x2); _mm256_storeu_si256((__m256i*)(y+31+1), x2);

The reason x and y are declared in size [31+27+31] is because in this case _mm256_loadu_si256() and _mm256_storeu_si256() will not call segfault.

And I can get the y value:

 for(int i=0; i<27; i++){ cout << (int)y[31+i] << ' '; }

New problem

Unfortunately, all vectors must be continuous in memory, for example, if there are only two vectors that need to be processed:

 x = {[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26]; [27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53]};

Then I can’t just use _mm256_storeu_si256() to return the y value to memory, because when the value of the second vector is written into memory, it will overwrite some values of the first vector:

 int8_t x[31+27+27+31]; int8_t y[31+27+27+31]; for(int i=0; i<27*2; i++){ x[31+i] = i; } for(int i=0; i<2; i++){ __m256i mask = _mm256_set_epi32 (0x0, 0x00800000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0); __m256i x_second_part = _mm256_loadu_si256((__m256i*)(x+31+27*i+1)); //{1,2,...,26} __m256i x_first_part = _mm256_loadu_si256((__m256i*)(x+31+27*i-26)); //{0} __m256i x2 = _mm256_blendv_epi8(x_second_part, x_first_part, mask); //{1,2,...,26, 0} _mm256_storeu_si256((__m256i*)(y+31+27*i-26), x2); _mm256_storeu_si256((__m256i*)(y+31+27*i+1), x2); } for(int i=0; i<27; i++){ cout << (int)y[31+i] << ' '; }cout << endl; for(int i=0; i<27; i++){ cout << (int)y[31+27+i] << ' '; }cout << endl;

displays

 0 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53

instead

 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53

So, I was thinking about using a mask. But in the Intel Intrinsic Guide, I could not find _mm256_maskstore_epi8 . This brings me back to the topic: