Penalty for switching from SSE to AVX?

I know about the existing penalty for switching from AVX instructions to SSE instructions without first resetting the upper halves of all ymm registers, but in my specific case on my machine (i7-3939K 3.2 GHz), it seems (SSE to AVX), even if I explicitly use _mm256_zeroupper before and after the AVX code section.

I wrote functions for converting between 32-bit floats and 32-bit fixed-point integers into 2 buffers that are 32768 elements wide. I ported the built-in version of SSE2 directly to AVX to simultaneously execute 8 elements on SSE 4, expecting a significant increase in performance, but unfortunately the opposite happened.

So, I have 2 functions:

void ConvertPcm32FloatToPcm32Fixed(int32* outBuffer, const float* inBuffer, uint sampleCount, bool bUseAvx) { const float fScale = (float)(1U<<31); if (bUseAvx) { _mm256_zeroupper(); const __m256 vScale = _mm256_set1_ps(fScale); const __m256 vVolMax = _mm256_set1_ps(fScale-1); const __m256 vVolMin = _mm256_set1_ps(-fScale); for (uint i = 0; i < sampleCount; i+=8) { const __m256 vIn0 = _mm256_load_ps(inBuffer+i); // Aligned load const __m256 vVal0 = _mm256_mul_ps(vIn0, vScale); const __m256 vClamped0 = _mm256_min_ps( _mm256_max_ps(vVal0, vVolMin), vVolMax ); const __m256i vFinal0 = _mm256_cvtps_epi32(vClamped0); _mm256_store_si256((__m256i*)(outBuffer+i), vFinal0); // Aligned store } _mm256_zeroupper(); } else { const __m128 vScale = _mm_set1_ps(fScale); const __m128 vVolMax = _mm_set1_ps(fScale-1); const __m128 vVolMin = _mm_set1_ps(-fScale); for (uint i = 0; i < sampleCount; i+=4) { const __m128 vIn0 = _mm_load_ps(inBuffer+i); // Aligned load const __m128 vVal0 = _mm_mul_ps(vIn0, vScale); const __m128 vClamped0 = _mm_min_ps( _mm_max_ps(vVal0, vVolMin), vVolMax ); const __m128i vFinal0 = _mm_cvtps_epi32(vClamped0); _mm_store_si128((__m128i*)(outBuffer+i), vFinal0); // Aligned store } } } void ConvertPcm32FixedToPcm32Float(float* outBuffer, const int32* inBuffer, uint sampleCount, bool bUseAvx) { const float fScale = (float)(1U<<31); if (bUseAvx) { _mm256_zeroupper(); const __m256 vScale = _mm256_set1_ps(1/fScale); for (uint i = 0; i < sampleCount; i+=8) { __m256i vIn0 = _mm256_load_si256(reinterpret_cast<const __m256i*>(inBuffer+i)); // Aligned load __m256 vVal0 = _mm256_cvtepi32_ps(vIn0); vVal0 = _mm256_mul_ps(vVal0, vScale); _mm256_store_ps(outBuffer+i, vVal0); // Aligned store } _mm256_zeroupper(); } else { const __m128 vScale = _mm_set1_ps(1/fScale); for (uint i = 0; i < sampleCount; i+=4) { __m128i vIn0 = _mm_load_si128(reinterpret_cast<const __m128i*>(inBuffer+i)); // Aligned load __m128 vVal0 = _mm_cvtepi32_ps(vIn0); vVal0 = _mm_mul_ps(vVal0, vScale); _mm_store_ps(outBuffer+i, vVal0); // Aligned store } } } 

So, I start the timer, start ConvertPcm32FloatToPcm32Fixed, then ConvertPcm32FixedToPcm32Float to convert straight backward, end the timer. SSE2 function versions run a total of 15-16 microseconds, but XVX versions take 22-23 microseconds. A little puzzled, I dug a little further, and I discovered how to speed up the AVX versions so that they go faster than the SSE2 versions, but are deceiving. I just start ConvertPcm32FloatToPcm32Fixed before starting the timer, then start the timer and start ConvertPcm32FloatToPcm32Fixed again, then ConvertPcm32FixedToPcm32Float, stop the timer. As if there is a massive penalty for SSE in AVX, if I “launched” the AVX version first with a trial run, the AVX runtime will be reduced to 12 microseconds, while the same with SSE equivalents reduces the time by from microseconds to 14, which makes AVX is a marginal winner here, but only if I cheat. I thought AVX didn’t play as well with cache as SSE, but using _mm_prefetch didn’t help him either.

Did I miss something?

+4
source share
2 answers

I have not tested your code, but since your test looks rather short, you may see the floating-point warm-up effect that Agner Fog discusses on page 101 of his microarchitecture (this applies to Sandy Bridge architecture). I quote:

The processor is in a cold state when it has not seen any floating for some time. The delay for the 256-bit vector of additions and multiplications is initially two hours longer than the ideal number, then one cycle longer, and after several hundred floating point the processor goes into a warm state, where the delays are 3 and 5 hours, respectively. The bandwidth is half the ideal value for 256-bit cold vector operations. 128-bit this warm-up effect has a smaller effect on vector operations. the latency of 128-bit vector additions and multiplications by one clock cycle is longer than the ideal value, and the throughput does not decrease in the cold state.

+5
source

I got the impression that if the compiler does not encode SSE instructions using the VEX instruction format, as Paul P said, instead of vmulps, the muta hit is massive.

When optimizing small segments, I try to use this good tool from Intel in tandem with some good landmarks

https://software.intel.com/en-us/articles/intel-architecture-code-analyzer

The report created by IACA includes the following notations:

"@ - the SSE instruction followed the AVX256 command, a dozen cycles are expected"

+2
source

Source: https://habr.com/ru/post/1491965/


All Articles