SSE takes much more time on AMD than on Intel

I am working on optimizing the algorithm using SSE2 instructions. But I ran into this problem when testing performance:

I) Intel e6750

  • Running a 4x algorithm without SSE2 takes 14.85 seconds
  • Performing 1 time SSE2 algorithm (processes the same data), takes 6.89 seconds.

II) Phenom II x4 2.8Ghz

  • Executing 4 times non-SSE2 algorithms takes 11.43 seconds
  • Performing 1 time SSE2 algorithm (processes the same data), takes 12.15 seconds

Can anyone help me why this is happening? I am really confused by the results.

In both cases, I compile with g ++ using -O3 as a flag.

PS: The algorithm does not use floating point math, it uses whole SSE instructions.

+6
source share
1 answer

Intel has made great strides in implementing SSE over the past 5 years or so, which AMD has not really done. Initially, both were really only 64-bit executive units, and 128-bit operations were divided into 2 micro-operations. Since the introduction of Core and Core 2, Intel processors have had a full 128-bit implementation of SSE, which means that 128-bit operations effectively increase 2x throughput (1 micro versus 2). More modern Intel processors also have several SSE execution units, which means you can get> 1 instruction per clock for 128-bit SIMD instructions.

+3
source

Source: https://habr.com/ru/post/890867/


All Articles