SSE4 and SSE2 regarding integer and float performance - which is faster?

While you usually get better arithmetic performance of integers than floating-point performance on processors, someone can clarify what happens to SIMD versions. For instance:

__m128i _mm_mul_epi32(__m128i a, __m128i b); //(multiplies 2 integer vectors) 

vs

 __m128 _mm_mul_ps(__m128 a , __m128 b ); //(multiplies 2 float vectors) 

What gives better performance? (assuming the machine has SSE4 capabilities). I say this because I encoded my own small math library based on SSE2 instructions, and I don't know if I should use __m128i correctly.

+4
source share
1 answer

Let me show you the first thing I want to answer these questions: Intel Intelligent Guide on the Internet. You provide inside information, and it tells you what it does, and provides the latency and bandwidth of Nehalem through Haswell processors (and soon Broadwell). Here are the results:

_mm_mul_ps

  Latency Reciprocal throughput Haswell 5 0.5 Ivy Bridge 5 1 Sandy Bridge 5 1 Westmere 4 1 Nehalem 4 1 

_mm_mul_epi32

  Latency Reciprocal throughput Haswell 5 1 Ivy Bridge 3 1 Sandy Bridge 3 1 Westmere 3 1 Nehalem 3 1 

Lower latency and reverse throughput are better. From these tables we can conclude that

  • with the exception of Haswell, the delay for _mm_mul_epi32 less than for _mm_mul_ps ,
  • on Haswell, the delay is the same,
  • with the exception of Haswell, the throughput is the same
  • Haswell _mm_mul_ps twice as much bandwidth for _mm_mul_ps as _mm_mul_epi32 .

Bandwidth on Jasuel is the only major surprise.

If you want to get results for pre-Nehalem processors and / or AMD processors, see the Agner Fog User Guide or run it to test the programs that it used to measure latency and throughput.

+3
source

Source: https://habr.com/ru/post/1498756/


All Articles