Summary: I am not aware of any recent x86 architecture that requires additional delays when using the “wrong” boot instruction (that is, a boot instruction followed by an ALU instruction from the opposite domain).
Here, Agner has to say about bypass delays, which are delays that can occur when switching between different execution domains with a CPU (sometimes this is unavoidable, but sometimes they can be caused by using the “wrong” version of the instruction that is mentioned here):
Delays bypass data in Nehalem. At Nehalem, execution units are divided into five “domains”:
The whole domain handles all general purpose registers. Integer Vector (SIMD) processes entire operations in vector registers. The FP area handles floating point operations in the XMM and x87 registers. The boot domain handles all reads in memory. The repository domain handles all repositories. There is an additional latency of 1 or 2 clock cycles, when the output of an operation in one domain is used as an entrance to another domain. These so-called bypass delays are listed in table 8.2.

there is still no extra bypass delay for using the load and saving instructions for the wrong data type. For example, it may be convenient to use MOVHPS for integer data to read or write the upper half of the XMM register.
The emphasis in the last paragraph is mine and is a key part: bypass delays do not extend to instructions for loading and storing Nehalem. Intuitively, this makes sense: loading and storage units are designed for the entire kernel and should make their result available for any execution unit (or save it in PRF) - unlike the ALU case, there are no the same problems with forwarding.
Now Nehalem doesn’t really care, but in the sections for Sandy Bridge / Ivy Bridge, Haswell and Skylake you will find a note that domains are also discussed for Nehalem, and that there are generally fewer delays. Thus, we can assume that the behavior in which loads and storages do not suffer from delay based on the type of instruction remains.
We can also check it out. I wrote a test like this:
bypass_movdqa_latency: sub rsp, 120 xor eax, eax pxor xmm1, xmm1 .top: movdqa xmm0, [rsp + rax] ; 7 cycles pand xmm0, xmm1 ; 1 cycle movq rax, xmm0 ; 1 cycle dec rdi jnz .top add rsp, 120 ret
This loads the value using movdqa , performs an integer domain operation ( pand ) on it, and then moves it to the general register rax , so it can be used as part of the address for movdqa in the next loop. I also created 3 other tests identical to the ones above, except for replacing movdqa with movdqu , movups and movupd .
Results on a Skylake client (i7-6700HQ with recent microcode):
** Running benchmark group Vector unit bypass latency ** Benchmark Cycles movdqa [mem] -> pxor latency 9.00 movdqu [mem] -> pxor latency 9.00 movups [mem] -> pxor latency 9.00 movupd [mem] -> pxor latency 9.00
In each case, the latency of the routing was the same: 9 cycles, as expected: 6 + 1 + 2 cycles for the load, pxor and movq respectively.
All these tests are added to the uarch-bench if you want to run them on any other architecture (I would be interested in the results). I used the command line:
./uarch-bench.sh --test-name=vector/* --timer=libpfc