Is there a situation where using MOVDQU and MOVUPD is better than MOVUPS?

Question

Is there a situation where using MOVDQU and MOVUPD is better than MOVUPS?

I tried to understand the various MOV instructions for SSE on Intel x86-64.

According to this, you should use aligned instructions (MOVAPS, MOVAPD and MOVDQA) when moving data between two registers, using the correct one for which you are working. And use MOVUPS / MOVAPS when moving the register to memory and vice versa, since the type does not affect performance when moving to / from memory.

So is there any reason to use MOVDQU and MOVUPD? Is the explanation I got the link wrong?

+5

assembly x86 x86-64 sse intel

Damian pereira Nov 28 '16 at 10:41

source share

2 answers

BeeOnRope · Answer 1 · 2018-02-10T20:42:11+0000

Summary: I am not aware of any recent x86 architecture that requires additional delays when using the “wrong” boot instruction (that is, a boot instruction followed by an ALU instruction from the opposite domain).

Here, Agner has to say about bypass delays, which are delays that can occur when switching between different execution domains with a CPU (sometimes this is unavoidable, but sometimes they can be caused by using the “wrong” version of the instruction that is mentioned here):

Delays bypass data in Nehalem. At Nehalem, execution units are divided into five “domains”:
The whole domain handles all general purpose registers. Integer Vector (SIMD) processes entire operations in vector registers. The FP area handles floating point operations in the XMM and x87 registers. The boot domain handles all reads in memory. The repository domain handles all repositories. There is an additional latency of 1 or 2 clock cycles, when the output of an operation in one domain is used as an entrance to another domain. These so-called bypass delays are listed in table 8.2.

there is still no extra bypass delay for using the load and saving instructions for the wrong data type. For example, it may be convenient to use MOVHPS for integer data to read or write the upper half of the XMM register.

The emphasis in the last paragraph is mine and is a key part: bypass delays do not extend to instructions for loading and storing Nehalem. Intuitively, this makes sense: loading and storage units are designed for the entire kernel and should make their result available for any execution unit (or save it in PRF) - unlike the ALU case, there are no the same problems with forwarding.

Now Nehalem doesn’t really care, but in the sections for Sandy Bridge / Ivy Bridge, Haswell and Skylake you will find a note that domains are also discussed for Nehalem, and that there are generally fewer delays. Thus, we can assume that the behavior in which loads and storages do not suffer from delay based on the type of instruction remains.

We can also check it out. I wrote a test like this:

bypass_movdqa_latency: sub rsp, 120 xor eax, eax pxor xmm1, xmm1 .top: movdqa xmm0, [rsp + rax] ; 7 cycles pand xmm0, xmm1 ; 1 cycle movq rax, xmm0 ; 1 cycle dec rdi jnz .top add rsp, 120 ret

This loads the value using movdqa , performs an integer domain operation ( pand ) on it, and then moves it to the general register rax , so it can be used as part of the address for movdqa in the next loop. I also created 3 other tests identical to the ones above, except for replacing movdqa with movdqu , movups and movupd .

Results on a Skylake client (i7-6700HQ with recent microcode):

 ** Running benchmark group Vector unit bypass latency ** Benchmark Cycles movdqa [mem] -> pxor latency 9.00 movdqu [mem] -> pxor latency 9.00 movups [mem] -> pxor latency 9.00 movupd [mem] -> pxor latency 9.00

In each case, the latency of the routing was the same: 9 cycles, as expected: 6 + 1 + 2 cycles for the load, pxor and movq respectively.

All these tests are added to the uarch-bench if you want to run them on any other architecture (I would be interested in the results). I used the command line:

 ./uarch-bench.sh --test-name=vector/* --timer=libpfc

Grigory rechistov · Answer 2 · 2018-02-09T15:15:55+0000

Please note that the link you reference to SSE performance is quite old and can only be applied to older generations of Intel hardware. I found out that the latest microarchitectures improve performance, for example. non-standard load commands when they are used for actually aligned data. In general, a short test is your best source of reliable information applicable to the specific equipment you have.

Is there a situation where using MOVDQU and MOVUPD is better than MOVUPS?

More articles: