Two additions to a long integer

Question

Two additions to a long integer

I want to do a long integer math (128 bits) with Intel I64 assembler and create 2 add-ons. Say my positive value is in RDX: RAX.

2, addition is performed by “flipping bits and adding 1”. Thus, the most naive implementation (4 commands and 14 bytes of code):

NOT RAX NOT RDX ADD RAX,1 ; Can't use INC, it doesn't set Carry ADC RDX,0

When I use the NEG instruction on RAX instead of NOT, for me it is “+1”, but Carry is wrong, NEG RAX clears Carry when RAX is zero, but I need Carry JUST IN THE THIS CASE. So the following best approach could be (4 instructions and 11 bytes of code):

  NOT RDX NEG RAX CMC ADC RDX,0 ; fixed, thanks lurker

4 more instructions. But instead of adding +1, I can subtract -1, and since the SBB adds Carry-Bit to the subtrahend, I will add +1 when Carry becomes clear. So my next best attempt is three instructions and 10 bytes of code:

  NOT RDX NEG RAX SBB RDX,-1

As you can see from my long curvy text, this is not obvious to understand. Is there a better, more understandable way to make cascading addition 2 in assembler?

+7

assembly x86-64 micro-optimization twos-complement

Rolf Apr 3 '15 at 20:00

source share

2 answers

phuclv · Answer 1 · 2015-04-04T03:53:48+0000

Shorter instructions or fewer instructions do not necessarily mean faster execution, because the delay and throughput for each instruction are different

For example, obsolete instructions like enter , dad , loop ... will be very slow and run only for backward compatibility. Even inc sometimes slower than add . The same for cmc you used above on some µarch

As a result, a longer series of low-latency instructions that can be executed in parallel will work much faster. Some general instruction groups can even be combined into one macro operation. Compiler optimizers always know this and will choose the most suitable instructions for generation.

For this fragment

 __int128 negate(__int128 x) { return -x; }

ICC 19.0.1 will generate the following instructions

  xor edx, edx #3.13 xor eax, eax #3.13 sub rax, rdi #3.13 sbb rdx, rsi #3.13

The first two xor statements are zero µop because they are processed at the register-rename stage . Now you have only 2 instructions to execute

You can switch the compiler in the Godbolt link above to see different ways of denying it by different compilers, including MSVC (unfortunately, it does not have a 128-bit type yet). Below are the results for GCC and Clang.

GCC 8.3:

  mov rax, rdi neg rax mov rdx, rsi adc rdx, 0 neg rdx

Clang:

  mov rax, rdi xor edx, edx neg rax sbb rdx, rsi

As you can see, Clang also uses only 3 instructions (with the exception of the first, which moves the data from the input argument to the desired destination). But, like xor reg, reg , mov can also be "free"

Things may differ if you optimize for space (for example, in some cases where cache misses are high) because some immediate and long instructions

Faster or not, you need a bit of micro benchmarking. But on Intel processors, the Intel Compiler (ICC) often achieves better performance than others because it has a better understanding of architecture.

Note that this operation is called negation, not two additions, which are a way of encoding negative numbers.

Peter Cordes · Answer 2 · 2019-04-04T21:42:51+0000

By the way, the negation of a number from 2 registers is the same in 32-bit or 16-bit mode with EDX: EAX or DX: AX. Use the same sequence of instructions.

To copy and deny, @phuclv's answer shows the efficient compiler output. The best bet is a xor-zeroing destination and then using sub / sbb .

This is a 4 mop for the frontend on AMD, on Intel Broadwell and later. On Intel, before Broadwell, sbb reg,reg is 2 mops. Zeroing on the XOR axis is outside the critical path (this can happen before the data to be denied is ready), so the total delay is 2 or 3 cycles for the senior half. The low half, of course, is ready with a delay of 1 cycle.

Clang mov/neg for the younger half might be better on Ryzen, which has mov-elission for an integer GP, but still requires an ALU execution unit for xor-zeroing. But for older processors, this puts mov on the critical delay path. But, as a rule, the internal pressure of the ALU is not as great as the bottlenecks on the side, for instructions that can use any ALU port.

To negate in place, use neg to subtract from 0

 neg rdx ; high half first neg rax ; subtract RDX:RAX from 0 sbb rdx, 0 ; with carry from low to high half

neg exactly matches sub from 0 if you set flags and performance.

An ADC / SBB with an immediate 0 is only 1 MOP on Intel SnB / IvB / Haswell , as a special case. It's still 2 mops on Nehalem and earlier, though. But without removing mov sbb to a different register, then sbb back to RDX will be slower.

The bottom half (in RAX) is ready in the first loop after it is ready as input for neg . (Thus, the execution of later code out of order may begin using the lower half.)

High Half neg rdx can work in parallel with the lower half. Then sbb rdx,0 should wait for rdx from neg rdx and CF from neg rax . Thus, it is ready for the end of 1 cycle after the lower half or 2 cycles after the input upper half is ready.

The above sequence is better than any of the above, since it contains fewer mops on very common Intel processors. On Broadwell and later ( SBB one go, not just for immediate 0)

 ;; equally good on Broadwell/Skylake, and AMD. But worse on Intel SnB through HSW NOT RDX NEG RAX SBB RDX,-1 ; can't use the imm=0 special case

Any of the 4 sequences of instructions, obviously, is not optimal and is a more complete number of mops. And some of them have the worst ILP / dependency / delay chains, for example, 2 critical path instructions for the lower half or a chain of 3 cycles for the upper half.

Two additions to a long integer

More articles: