How to access the components of a 256-bit ps vector

How to effectively access the elements of a 256-bit vector? For example, I calculated a point product using

c = _mm256_dp_ps(a, b, 0xff); 

How to access value in c? I need to get both the high part and the low part, do I understand correctly that I first need to extract the 128-bit parts as follows:

 r0 = _mm256_extractf128_ps(c,0); r1 = _mm256_extractf128_ps(c,1); 

And only then extract the float:

 _MM_EXTRACT_FLOAT(fr0, r0, 0); _MM_EXTRACT_FLOAT(fr1, r1, 0); return fr0 + fr1; 
+4
source share
2 answers

There is no effective way to do this. The dp_ps operation itself is slow, and subsequent extraction is slow. If you cannot process more data in the packet, it is faster to use SSE4 commands to calculate the point product and work with 128 bits than with 256 bits.

-1
source

Well, you can just store in memory and then work with scalars:

 float v[8]; *(__m256)(v) = _mm256_dp_ps(a, b, 0xff); float result = v[0] + v[4]; 

You can also change the upper part of the lower half of the 256-bit register and add, for example:

 __m256 c = _mm256_dp_ps(a, b, 0xff); __m256 d = _mm256_permute2f128_ps(c, c, 1); __m256 result = _mm256_add_ps(c, d); 

Probably much faster than any of these options is to make 4-8-point products at the same time and bring them together. Sketch:

 d0 = _mm256_dp_ps(a[0], b[0], 0xff); d1 = _mm256_dp_ps(a[1], b[1], 0xff); d2 = _mm256_dp_ps(a[2], b[2], 0xff); d3 = _mm256_dp_ps(a[3], b[3], 0xff); d01 = _mm256_permute_ps(d0, d1, ...); d23 = _mm256_permute_ps(d2, d3, ...); d0123 = _mm256_permute_ps(d01, d23, ...); d0123upper = _mm256_permute2f128_ps(d0123, d0123, 1); d = _mm256_add_ps(d0123upper, d0123); // lower 128 bits contain the results of 4 8-wide dot products 
+4
source

Source: https://habr.com/ru/post/1441128/


All Articles