How to extract 8 integers from 256 vectors using intel intrinsics?

Question

How to extract 8 integers from 256 vectors using intel intrinsics?

I am trying to improve the performance of my code with a 256-bit vector (Intel intrinsics - AVX).

I have an I7 Gen.4 processor (Haswell architecture) that supports SSE1-SSE4.2 and AVX / AVX2 extensions.

This is a piece of code I'm trying to improve:

/* code snipet */ kfac1 = kfac + factor; /* 7 cycles for 7 additions */ kfac2 = kfac1 + factor; kfac3 = kfac2 + factor; kfac4 = kfac3 + factor; kfac5 = kfac4 + factor; kfac6 = kfac5 + factor; kfac7 = kfac6 + factor; k1fac1 = k1fac + factor1; /* 7 cycles for 7 additions */ k1fac2 = k1fac1 + factor1; k1fac3 = k1fac2 + factor1; k1fac4 = k1fac3 + factor1; k1fac5 = k1fac4 + factor1; k1fac6 = k1fac5 + factor1; k1fac7 = k1fac6 + factor1; k2fac1 = k2fac + factor2; /* 7 cycles for 7 additions */ k2fac2 = k2fac1 + factor2; k2fac3 = k2fac2 + factor2; k2fac4 = k2fac3 + factor2; k2fac5 = k2fac4 + factor2; k2fac6 = k2fac5 + factor2; k2fac7 = k2fac6 + factor2; /* code snipet */

In Intel manuals I found this.

integer ADD addition takes 1 cycle (latency).
a vector of 8 integers (32 bits) also takes 1 cycle.

So, I tried ton to do this:

 fac = _mm256_set1_epi32 (factor ) fac1 = _mm256_set1_epi32 (factor1) fac2 = _mm256_set1_epi32 (factor2) v1 = _mm256_set_epi32 (0,kfac6,kfac5,kfac4,kfac3,kfac2,kfac1,kfac) v2 = _mm256_set_epi32 (0,k1fac6,k1fac5,k1fac4,k1fac3,k1fac2,k1fac1,k1fac) v3 = _mm256_set_epi32 (0,k2fac6,k2fac5,k2fac4,k2fac3,k2fac2,k2fac1,k2fac) res1 = _mm256_add_epi32 (v1,fac) //////////////////// res2 = _mm256_add_epi32 (v2,fa1) // just 3 cycles // res3 = _mm256_add_epi32 (v3,fa2) ////////////////////

But the problem is that these factors will be used as table indexes (table [kfac] ...). So I need to extract this factor again as separate integers. I wonder if there is a way to do this?

+5

c x86 avx simd intrinsics

A.nechi Jun 30 '17 at 13:32

source share

1 answer

Peter Cordes · Accepted Answer · 2017-06-30T13:46:45+0000

The smart compiler can get table+factor into the register and use indexed addressing modes to get table+factor+k1fac6 as the address. Check out asm, and if the compiler doesn't do this for you, try changing the source to hold the compiler:

 const int *tf = table + factor; const int *tf2 = table + factor2; // could be lea rdx, [rax+rcx*4] or something. ... foo = tf[kfac2]; bar = tf2[k2fac6]; // could be mov r12, [rdx + rdi*4]

But to answer the question you asked:

Delay is not a big deal when you have many independent add-ons. The bandwidth of 4 scalar add instructions per clock on Haswell is much more relevant.

If k1fac2 etc. already in adjacent memory, it may be worth using SIMD. Otherwise, all the shuffling and data transfer to get them to / from vector reg makes it definitely impractical. (i.e., the material compiler emits the implementation of _mm256_set_epi32 (0,kfac6,kfac5,kfac4,kfac3,kfac2,kfac1,kfac) .

You can avoid the need to return indexes in whole registers by using the AVX2 assembly to load the table. But gathering slowly at Haswell, so probably not worth it. Perhaps worth it to Broadwell.

Get ready on Skylake quickly, so it might be nice if you can SIMD, no matter what you do with the LUT results. If you need to extract all the collection results back into separate whole registers, this is probably not worth it.

If you need to extract 8x 32-bit integers from __m256i into integer registers , you have three main strategy options:

Vector storage in tmp array and scalar loads
ALU shuffle instructions such as pextrd ( _mm_extract_epi32 ). Use _mm256_extracti128_si256 to get a high strip in a separate __m128i .
A combination of both strategies (for example, keep a high 128 in memory when using ALU materials in the lower half).

Depending on the surrounding code, any of these three may be optimal for Haswell.

pextrd r32, xmm, imm8 - 2 uops in Haswell, one of which needs to be shuffled at the port5. That a lot of shuffle uops, so a pure ALU strategy would only be good if your code would be a bottleneck in the bandwidth of the L1d cache. (Not the same as memory bandwidth). movd r32, xmm only 1 micron, and the compilers really know that when compiling _mm_extract_epi32(vec, 0) , but you can also write int foo = _mm_cvtsi128_si32(vec) to make it explicit and remind yourself that you can access the lower element get more efficient.

Saving / reloading has good bandwidth. Intel SnB-family processors, including Haswell, can run two loads per cycle, and moving the IIRC storage works from an aligned 32-byte storage with any 4-byte element. But make sure this is a aligned storage, for example. in _Alignas(32) int tmp[8] or in the union between the __m256i and int array. You can still store int in the array instead of the __m256i member in order to avoid punctuation, while still having array alignment, but the easiest way is to use C ++ 11 alignas or C11 _Alignas .

  _Alignas(32) int tmp[8]; _mm256_store_si256((__m256i*)tmp, vec); ... foo2 = tmp[2];

However, the problem with store / reload is latency. Even the first result will not be ready for 6 cycles after the storage data is ready.

A mixed strategy gives you the best of both worlds: ALU to retrieve the first 2 or 3 elements allows you to start a run on any code, using them, hiding the delay in storage / reboot.

  _Alignas(32) int tmp[8]; _mm256_store_si256((__m256i*)tmp, vec); __m128i lo = _mm256_castsi256_si128(vec); // This is free, no instructions int foo0 = _mm_cvtsi128_si32(lo); int foo1 = _mm_extract_epi32(lo, 1); foo2 = tmp[2]; // rest of foo3..foo7 also loaded from tmp[] // Then use foo0..foo7

You may find that it is optimal to use the first 4 elements with pextrd , in which case you only need to save / reload the top bar. Use vextracti128 [mem], ymm, 1 :

 _Alignas(16) int tmp[4]; _mm_store_si128((__m128i*)tmp, _mm256_extracti128_si256(vec, 1)); // movd / pextrd for foo0..foo3 int foo4 = tmp[0]; ...

With fewer elements (for example, 64-bit integers), a more efficient ALU strategy is more attractive. 6-cycle storage vector storage / integer reload longer than it would take to get all the results using ALU, but saving / reloading can still be good if there is a lot of level at the level of parallelism instructions and you will be the bottleneck on ALU instead of a delay.

With smaller elements (8 or 16 bits), saving / reloading is definitely attractive. Retrieving the first 2-4 elements with ALU instructions is still good. And maybe even vmovd r32, xmm , and then a choice other than the whole shift / mask commands, is good.

Your loop count for the vector version is also dummy. The three _mm256_add_epi32 operations _mm256_add_epi32 independent, and Haswell can execute two vpaddd instructions vpaddd parallel. (Skylake can run all three in one cycle, each with a delay of 1 cycle.)

Out-of-order superscalar pipelining means a big difference between latency and throughput, as well as tracking dependency chains. See http://agner.org/optimize/ and other links in x86 wiki tags for more optimization tips.

How to extract 8 integers from 256 vectors using intel intrinsics?

More articles: