It looks like you can't use imul without a bunch of extra code, since CF and OF are both set the same. As stated in the operation section of the manual , they are set if the full result of 128b does not match sign_extend(low_half_result) . So you're right, even multi- imul forms still have some signed behavior. It would be nice if they were like add / sub and set OF and CF independently, so you can see CF for unsigned data or OF for signed data.
One of the best ways to find a good asm sequence for something is to ask the compiler. C does not have convenient integer detection, but Rust does .
I compiled this function to return a value and define an unsigned bool. Apparently, Rust ABI returns their passing pointer as a hidden first arg, and not in rdx: rax, as I think C ABI will be for such a small structure .:(
pub fn overflowing_mul(a: u64, b: u64) -> (u64, bool) { a.overflowing_mul(b) }
# frame-pointer boilerplate elided mov rax, rsi mul rdx mov qword ptr [rdi], rax seto byte ptr [rdi + 8] mov rax, rdi # return the pointer to the return-value ret
Asm is derived from the Godbolt compiler explorer (Rust 1.7.0) . This more or less confirms that the mov command and the additional uop for one-time full multiplication are more efficient than all that we could do with additional checks after two imul operands.
Documentation for mul says
"The OF and CF flags are set to 0 if the upper half of the result is 0, otherwise they are set to 1."
So use mul and check OF or CF to see that the top half is not zero.
mul vs. imul little things:
Only the upper half of the total multiplication result (N x N => 2N) differs from imul and mul . I think Intel chose imul as the one that would have several explicit operands, so imul r32, r32, sign-extended-imm8 would make more sense, since a character extension is probably more useful than a null extension.
I only realized that the flag from imul was only signed. An interesting point.
why doesn't gcc use mul for unsigned multiplication?
Since the mul / imul single operand is slower (instead of 2 Intel processors instead of 2, according to the Agner Fog insn tables . X86 tag wiki). They also use more registers: they require one of their entries in rax and produce their outputs in rdx:rax , so additional mov instructions are usually required to move data to / from these regs.
So imul r64, r64 is a better choice than mul r64 if you don't need the result of the flag.
On Intel imul r64,r64 is faster than mul r32 . This does not apply to some other processors, including the AMD Bulldozer family, where 64-bit multiplication is somewhat slower. But since mul r32 puts its results in edx:eax instead of a single destination register, in any case they are not direct replacements for each other.