Well, you can just store in memory and then work with scalars:
float v[8]; *(__m256)(v) = _mm256_dp_ps(a, b, 0xff); float result = v[0] + v[4];
You can also change the upper part of the lower half of the 256-bit register and add, for example:
__m256 c = _mm256_dp_ps(a, b, 0xff); __m256 d = _mm256_permute2f128_ps(c, c, 1); __m256 result = _mm256_add_ps(c, d);
Probably much faster than any of these options is to make 4-8-point products at the same time and bring them together. Sketch:
d0 = _mm256_dp_ps(a[0], b[0], 0xff); d1 = _mm256_dp_ps(a[1], b[1], 0xff); d2 = _mm256_dp_ps(a[2], b[2], 0xff); d3 = _mm256_dp_ps(a[3], b[3], 0xff); d01 = _mm256_permute_ps(d0, d1, ...); d23 = _mm256_permute_ps(d2, d3, ...); d0123 = _mm256_permute_ps(d01, d23, ...); d0123upper = _mm256_permute2f128_ps(d0123, d0123, 1); d = _mm256_add_ps(d0123upper, d0123);
source share