I found that in some SSE-optimized code, they use a combination of movlps and movhps instead of a single movups command to transmit mismatched data to calculate math. I don't know why, so I tried this myself, and this is the pseudo code below:
struct Vec4 { float f[4]; }; const size_t nSize = sizeof(Vec4) * 100; Vec4* pA = (Vec4*)malloc( nSize ); Vec4* pB = (Vec4*)malloc( nSize ); Vec4* pR = (Vec4*)malloc( nSize ); ...Some data initialization code here ...Records current time by QueryPerformanceCounter() for( int i=0; i<100000, ++i ) { for( int j=0; j<100; ++j ) { Vec4* a = &pA[i]; Vec4* b = &pB[i]; Vec4* r = &pR[i]; __asm { mov eax, a mov ecx, b mov edx, r ...option 1: movups xmm0, [eax] movups xmm1, [ecx] mulps xmm0, xmm1 movups [edx], xmm0 ...option 2: movlps xmm0, [eax] movhps xmm0, [eax+8] movlps xmm1, [ecx] movhps xmm1, [ecx+8] mulps xmm0, xmm1 movlps [edx], xmm0 movhps [edx+8], xmm0 } } } ...Calculates passed time free( pA ); free( pB ); free( pR );
I ran the code many times and calculated their average time.
For the movups version, the result is about 50 ms.
For movlps, the version of movhps, the result is about 46 ms.
And I also tried the data-aligned version with the __declspec (align (16)) descriptor in structure and _aligned_malloc () is highlighted, the result is about 34 ms.
Why is the combination of movlps and movhps faster? Does this mean that it is better to use movlps and movhps instead of movups?
source share