Can there be any penalties when using 64/32-bit registers in Long mode?

Question

Can there be any penalties when using 64/32-bit registers in Long mode?

This probably doesn’t even concern micro-, but nano-optimizations, but the subject interests me, and I would like to know if there are any penalties when using non-local register sizes in long mode?

I learned from various sources that partial register updates (e.g. ax instead of eax ) can lead to eflags stopping and performance eflags . But I'm not sure about the long mode. What register size is considered native for this processor mode? x86-64 are still extensions for the x86 architecture, so I believe that 32 bits are still native. Or am I wrong?

For example, instructions like

 sub eax, r14d

or

 sub rax, r14

are the same size, but can there be any penalties when using any of them? Can there be any penalties for mixing register sizes in sequential instructions like the ones below? (assuming that the high dword is zero in all cases)

 sub ecx, eax sub r14, rax

+5

optimization assembly x86 micro-optimization

Alexander Zhak Oct 19 '16 at 21:21

source share

1 answer

Peter Cordes · Accepted Answer · 2016-10-19T22:27:08+0000

Can there be any penalties when mixing 32 and 64-bit registers in sequential instructions?

No, writing to the 32-bit register is always zero - it continues until the full register , so x86-64 avoids any penalties for partial register for 32 and 64-bit.

Thus, I believe that 32 bits are still native.

Yes, the default operand size for most instructions is 32-bit ( except for PUSH / POP ). A 64-bit code requires a REX prefix with the W bit set to 1. Prefer 32-bit reasons for the code. This is why compilers use mov r32, imm32 for static data addresses (since the default code model requires codes and static data addresses to be in the 2GiB low virtual address space).

It was a design from AMD. They could choose a different path and required a prefix to get the size of the 32-bit operand. Since continuous mode is a separate mode, machine code x86-64 may be different from machine code x86-32, but it wants to. AMD decided to minimize the differences so that they could share as many transistors in the decoders as possible. Your conclusion is correct, but your reasoning is completely fictitious.

partial register updates (for example, ax instead of eax) can cause eflags to stop and degrade performance.

Partial flag posts are separated from private register scores. They are handled similarly internally (separately renamed parts of EFLAGS must be combined in the same way as modified AX must be combined with unmodified EAX high bytes). But the other does not cause the other .

 # partial-reg stall setcc al # leaves the upper 3 (or 7) bytes unmodified add edx, eax # reads full EAX. Older CPUs stall while merging

Zeroing EAX before setting the flag and setcc with xor eax,eax completely eliminates the penalty for partial registrar . (Core2 / Nehalem stops for fewer cycles than previous processors, but still stops at 2 or 3c, inserting a uop merge. Sandybridge does not stop at all when inserting a uop merge).

(Another summary of partial register fines on different processors: Why doesn't GCC use partial registers ? , saying basically the same thing).

AMD does not suffer from incomplete registers when it reads a full register later, but instead partial registration of records and reads has a false dependence on a full register. (AMD processors do not rename subregisters separately in the first place. Intel P4 and Silvermont / Knight Landing are similar.)

Intel Haswell / Skylake (and possibly Ivybridge) do not rename al separately from rax at all , so they never need to combine low8 / low16 registers. But setcc al has a false dependency on the old value. They are still renaming and merging ah . ( Details of HSW / SKL partial write performance .

 # partial flag stall when reading a flag that didn't come from # the last instruction to write any flags. clc # edi and esi = one-past-the-end of dst and src # ecx = -count bigInt_add: mov eax, [esi+ecx*4] adc [edi+ecx*4], eax # reads CF, partial flag stall on 2nd and later iterations inc ecx # writes all flags except CF jl bitInt_add # loop upwards towards zero

See this Q&A question for a more detailed discussion of partial flag issues on Intel pre-Sandybridge vs. Sandybridge

See also Agar Fog microarch pdf and other links in x86 wiki tags for more details on all this.

Can there be any penalties when using 64/32-bit registers in Long mode?

More articles: