I know about the existing penalty for switching from AVX instructions to SSE instructions without first resetting the upper halves of all ymm registers, but in my specific case on my machine (i7-3939K 3.2 GHz), it seems (SSE to AVX), even if I explicitly use _mm256_zeroupper before and after the AVX code section.
I wrote functions for converting between 32-bit floats and 32-bit fixed-point integers into 2 buffers that are 32768 elements wide. I ported the built-in version of SSE2 directly to AVX to simultaneously execute 8 elements on SSE 4, expecting a significant increase in performance, but unfortunately the opposite happened.
So, I have 2 functions:
void ConvertPcm32FloatToPcm32Fixed(int32* outBuffer, const float* inBuffer, uint sampleCount, bool bUseAvx) { const float fScale = (float)(1U<<31); if (bUseAvx) { _mm256_zeroupper(); const __m256 vScale = _mm256_set1_ps(fScale); const __m256 vVolMax = _mm256_set1_ps(fScale-1); const __m256 vVolMin = _mm256_set1_ps(-fScale); for (uint i = 0; i < sampleCount; i+=8) { const __m256 vIn0 = _mm256_load_ps(inBuffer+i);
So, I start the timer, start ConvertPcm32FloatToPcm32Fixed, then ConvertPcm32FixedToPcm32Float to convert straight backward, end the timer. SSE2 function versions run a total of 15-16 microseconds, but XVX versions take 22-23 microseconds. A little puzzled, I dug a little further, and I discovered how to speed up the AVX versions so that they go faster than the SSE2 versions, but are deceiving. I just start ConvertPcm32FloatToPcm32Fixed before starting the timer, then start the timer and start ConvertPcm32FloatToPcm32Fixed again, then ConvertPcm32FixedToPcm32Float, stop the timer. As if there is a massive penalty for SSE in AVX, if I “launched” the AVX version first with a trial run, the AVX runtime will be reduced to 12 microseconds, while the same with SSE equivalents reduces the time by from microseconds to 14, which makes AVX is a marginal winner here, but only if I cheat. I thought AVX didn’t play as well with cache as SSE, but using _mm_prefetch didn’t help him either.
Did I miss something?
source share