By the way, the negation of a number from 2 registers is the same in 32-bit or 16-bit mode with EDX: EAX or DX: AX. Use the same sequence of instructions.
To copy and deny, @phuclv's answer shows the efficient compiler output. The best bet is a xor-zeroing destination and then using sub / sbb .
This is a 4 mop for the frontend on AMD, on Intel Broadwell and later. On Intel, before Broadwell, sbb reg,reg is 2 mops. Zeroing on the XOR axis is outside the critical path (this can happen before the data to be denied is ready), so the total delay is 2 or 3 cycles for the senior half. The low half, of course, is ready with a delay of 1 cycle.
Clang mov/neg for the younger half might be better on Ryzen, which has mov-elission for an integer GP, but still requires an ALU execution unit for xor-zeroing. But for older processors, this puts mov on the critical delay path. But, as a rule, the internal pressure of the ALU is not as great as the bottlenecks on the side, for instructions that can use any ALU port.
To negate in place, use neg to subtract from 0
neg rdx ; high half first neg rax ; subtract RDX:RAX from 0 sbb rdx, 0 ; with carry from low to high half
neg exactly matches sub from 0 if you set flags and performance.
An ADC / SBB with an immediate 0 is only 1 MOP on Intel SnB / IvB / Haswell , as a special case. It's still 2 mops on Nehalem and earlier, though. But without removing mov sbb to a different register, then sbb back to RDX will be slower.
The bottom half (in RAX) is ready in the first loop after it is ready as input for neg . (Thus, the execution of later code out of order may begin using the lower half.)
High Half neg rdx can work in parallel with the lower half. Then sbb rdx,0 should wait for rdx from neg rdx and CF from neg rax . Thus, it is ready for the end of 1 cycle after the lower half or 2 cycles after the input upper half is ready.
The above sequence is better than any of the above, since it contains fewer mops on very common Intel processors. On Broadwell and later ( SBB one go, not just for immediate 0)
;; equally good on Broadwell/Skylake, and AMD. But worse on Intel SnB through HSW NOT RDX NEG RAX SBB RDX,-1 ; can't use the imm=0 special case
Any of the 4 sequences of instructions, obviously, is not optimal and is a more complete number of mops. And some of them have the worst ILP / dependency / delay chains, for example, 2 critical path instructions for the lower half or a chain of 3 cycles for the upper half.