Why are the movlps and movhps SSE commands faster than transitions to carry invalid data?

I found that in some SSE-optimized code, they use a combination of movlps and movhps instead of a single movups command to transmit mismatched data to calculate math. I don't know why, so I tried this myself, and this is the pseudo code below:

struct Vec4 { float f[4]; }; const size_t nSize = sizeof(Vec4) * 100; Vec4* pA = (Vec4*)malloc( nSize ); Vec4* pB = (Vec4*)malloc( nSize ); Vec4* pR = (Vec4*)malloc( nSize ); ...Some data initialization code here ...Records current time by QueryPerformanceCounter() for( int i=0; i<100000, ++i ) { for( int j=0; j<100; ++j ) { Vec4* a = &pA[i]; Vec4* b = &pB[i]; Vec4* r = &pR[i]; __asm { mov eax, a mov ecx, b mov edx, r ...option 1: movups xmm0, [eax] movups xmm1, [ecx] mulps xmm0, xmm1 movups [edx], xmm0 ...option 2: movlps xmm0, [eax] movhps xmm0, [eax+8] movlps xmm1, [ecx] movhps xmm1, [ecx+8] mulps xmm0, xmm1 movlps [edx], xmm0 movhps [edx+8], xmm0 } } } ...Calculates passed time free( pA ); free( pB ); free( pR ); 

I ran the code many times and calculated their average time.

For the movups version, the result is about 50 ms.

For movlps, the version of movhps, the result is about 46 ms.

And I also tried the data-aligned version with the __declspec (align (16)) descriptor in structure and _aligned_malloc () is highlighted, the result is about 34 ms.

Why is the combination of movlps and movhps faster? Does this mean that it is better to use movlps and movhps instead of movups?

+4
source share
2 answers

Atons of this generation (K8) have only 64-bit ALUs. Therefore, each 128-bit SSE instruction must be divided into two 64-bit instructions, which leads to overhead for some instructions.

In this type of processor, as a rule, you will not find acceleration using SSE compared to the equal MMX code.

Quoting Agner Fog on Intel, AMD, and VIA Processor Microarchitecture: Optimization Guide for Building Programmers and Compiler Developers:

12.9 64-bit and 128-bit instructions

The big advantage is to use 128-bit instructions for K10, but not on K8, because each 128-bit instruction is divided into two 64-bit macro operations on K8.

Instructions for writing 128-bit memory are processed as two 64-bit macro operations on K10, while Reading 128-bit memory is performed using one macro operation on K10 (2 on K8).

Instructions for reading 128 bits of memory use only the FMISC block on the K8, but all three devices on the K10. Therefore, it is impractical to use XMM registers only for moving data blocks from one memory position to another on k8, but it is beneficial on K10.

+5
source

movups works with unaligned data. movlps, movhps only works with bound data. Of course movlps, movhps is faster. For calculation of time and comparison it is better to use rdtsc, but not ms.

+1
source

Source: https://habr.com/ru/post/1447851/


All Articles