Synthetic blend of SSE and AVX

Question

Synthetic blend of SSE and AVX

In addition to SSE copies, AVX copies, and std :: copies. . Suppose we need to vectorize some loop as follows: 1) vectorize the first loop-packet (which is a multiple of 8) via AVX. 2) divide the remainder of the cycle into two batches. Vectorization of a batch that is a multiple of 4 via SSE. 3) Process the remaining batch of the entire cycle through a sequential procedure. Consider an example of copying arrays:

#include <immintrin.h> template<int length, int unroll_bound_avx = length & (~7), int unroll_tail_avx = length - unroll_bound_avx, int unroll_bound_sse = unroll_tail_avx & (~3), int unroll_tail_last = unroll_tail_avx - unroll_bound_sse> void simd_copy(float *src, float *dest) { auto src_ = src; auto dest_ = dest; //Vectorize first part of loop via AVX for(; src_!=src+unroll_bound_avx; src_+=8, dest_+=8) { __m256 buffer = _mm256_load_ps(src_); _mm256_store_ps(dest_, buffer); } //Vectorize remainder part of loop via SSE for(; src_!=src+unroll_bound_sse+unroll_bound_avx; src_+=4, dest_+=4) { __m128 buffer = _mm_load_ps(src_); _mm_store_ps(dest_, buffer); } //Process residual elements for(; src_!=src+length; ++src_, ++dest_) *dest_ = *src_; } int main() { const int sz = 15; float *src = (float *)_mm_malloc(sz*sizeof(float), 16); float *dest = (float *)_mm_malloc(sz*sizeof(float), 16); float a=0; std::generate(src, src+sz, [&](){return ++a;}); simd_copy<sz>(src, dest); _mm_free(src); _mm_free(dest); }

Is it right to use SSE and AVX? Should AVX-SSE transitions be avoided?

+4

c ++ performance sse avx simd

gorill Aug 19 '13 at 17:20

source share

2 answers

I humbly ask for a distinction - I would advise you to try not to mix SSE and AVX, read in the link written by Mystical, it warns against such a mixture (although not emphasizing it is difficult enough). The question is that different code paths for different machines are supported in accordance with AVX support, so there is no mixture - in your case the mixture is very fine-grained and will be destructive (you can delay internal delays due to micro-architectural implementation).

To clarify - Mystic is right about the vex prefix in compilation, without it you would be in pretty bad shape, because you are carrying SSE2AVX, which helps every time, since the upper parts of your YMM registers cannot be ignored (unless you explicitly use vzeroupper). However, there are more subtle effects, even when using 128-bit AVX mixed with 256-bit AVX.

I also don’t see the benefits of using SSE here, since you have a long loop (say N> 100), you can take advantage of AVX for most of it and make the remainder in the scalar code up to 7 iterations (you might still have to do 3 of them). Loss of performance is no different than mixing AVX / SSE

Additional information about the mixture - http://software.intel.com/sites/default/files/m/d/4/1/d/8/11MC12_Avoiding_2BAVX-SSE_2BTransition_2BPenalties_2Brh_2Bfinal.pdf

0

Leeor Aug 19 '13 at 19:39

source share

Mysticial · Accepted Answer · 2013-08-19T18:29:20+0000

You can combine the built-in functions of SSE and AVX.

The only thing you want to make sure is to specify the correct compiler flag to enable AVX.

GCC: -mavx
Visual Studio: /arch:AVX

Otherwise, it will either lead to code compilation (GCC), or in the case of Visual Studio,
this kind of crap:

Using AVX processor instructions: poor performance without "/ arch: AVX"

What the flag does is that it forces all SIMD instructions to use VEX encoding to avoid the fines for switching states described in the question above.

Synthetic blend of SSE and AVX

More articles: