C unsigned long long and imulq

As someone new to the build, I use gcc for reverse engineering. But now I am faced with some kind of ridiculous problem: I am trying to multiply two 64-bit integers by x86-64. C code is as follows:

unsigned long long val(unsigned long long a, unsigned long long b){ return a*b; } 

and compiled with gcc:

 val: movq %rdi, %rax imulq %rsi, %rax ret 

It may be inconsistent to use signed multiplication for unsigned integers, but it works for C.

However, I would like to check the multiplication for overflow. Now the overflow flag is set if the result is greater than 2^63-1 (I think, because it means that this is multiplication at the end). But for a 64-bit unsigned, it would still be OK if the result is no more than 2^64-1 .

What is the correct way to do the multiplication (in assembly) in this case?

+4
source share
2 answers

When you multiply two values, the least significant bits of the result will be exactly the same, regardless of whether you perform unsigned or signed multiplication. Thus, if you multiply two 32-bit values, you get a 64-bit result, the low 32-bit ones are the same, regardless of whether the multiplication sign is signed or not. The same for 64-bit multiplication, which gives a 128-bit result, the lower 64-bit ones are identical in both cases.

As such, compilers often use the IMUL instruction (whose mnemonics offer signed multiplication) for both types of multiplication, because it is more flexible than MUL , and is usually faster. While MUL comes in only one form (allowing you to multiply an arbitrary general-purpose register or memory by an implied destination register AL / AX / EAX / RAX), IMUL has many forms, including a single-operation form (same as MUL ), two operands (register or memory, time, register or memory, or immediate) and a three-operation form (register or memory and time; immediate storage of the result in the third destination register). More information is available in the Intel documentation (see x86 tag wiki for links) or a short link for MUL and IMUL .

The reason the compiler can avoid using IMUL all the time is because you throw away the high bits of the result. When you do 32-bit and one time; 32-bit multiplication and storage of the result in a 32-bit variable, the upper 32 bits of the entire 64-bit result were discarded. Again, the same for 64-bit times; A 64-bit multiplication that discards the upper 64-bit data of a 128-bit result, leaving only the lower 64-bit ones that are the same, whether signed or unsigned.

Quote from Intel manual:

The two- and three-operation forms [IMUL] can also be used with unsigned operands because the bottom half of the product is the same regardless of whether the operands are signed or unsigned. However, the CF and OF flags cannot be used to determine if the upper half of the result is different.

Peter Cordes also explained this very well in the section of his larger answer to a very general question about arithmetic operations with two additions .

In any case, when you write the assembly code yourself, you need to decide whether you want to do the same thing the compiler does and discard the top bits of the products, or if you want to save them. If you do not need high bits and it is assumed that the operation will not overflow, write the same code as the compiler.

If you care about the high bits, just use the MUL instruction, which sets the CF and OF flags if the product of the multiplication is greater than can fit into the type of its operands.

 mov rax, QWORD PTR [a] ; put 64-bit operand 'a' into RAX mov rbx, QWORD PTR [b] ; put 64-bit operand 'b' into RBX mul rbx ; multiply 'a' * 'b' ; 128-bit result is returned in RDX:RAX (upper-bits:lower-bits) jo ProductOverflowed 

Using MUL here is almost certainly more efficient than trying to find a way to use IMUL and testing high 64-bit sequences to see if they are non-zero (which indicates overflow). Just having an unpredictable branch will put you behind in performance, compared to 1 or 2 microphones that you would save using IMUL .

+7
source

It looks like you can't use imul without a bunch of extra code, since CF and OF are both set the same. As stated in the operation section of the manual , they are set if the full result of 128b does not match sign_extend(low_half_result) . So you're right, even multi- imul forms still have some signed behavior. It would be nice if they were like add / sub and set OF and CF independently, so you can see CF for unsigned data or OF for signed data.

One of the best ways to find a good asm sequence for something is to ask the compiler. C does not have convenient integer detection, but Rust does .

I compiled this function to return a value and define an unsigned bool. Apparently, Rust ABI returns their passing pointer as a hidden first arg, and not in rdx: rax, as I think C ABI will be for such a small structure .:(

 pub fn overflowing_mul(a: u64, b: u64) -> (u64, bool) { a.overflowing_mul(b) } 
  # frame-pointer boilerplate elided mov rax, rsi mul rdx mov qword ptr [rdi], rax seto byte ptr [rdi + 8] mov rax, rdi # return the pointer to the return-value ret 

Asm is derived from the Godbolt compiler explorer (Rust 1.7.0) . This more or less confirms that the mov command and the additional uop for one-time full multiplication are more efficient than all that we could do with additional checks after two imul operands.

Documentation for mul says

"The OF and CF flags are set to 0 if the upper half of the result is 0, otherwise they are set to 1."

So use mul and check OF or CF to see that the top half is not zero.


mul vs. imul little things:

Only the upper half of the total multiplication result (N x N => 2N) differs from imul and mul . I think Intel chose imul as the one that would have several explicit operands, so imul r32, r32, sign-extended-imm8 would make more sense, since a character extension is probably more useful than a null extension.

I only realized that the flag from imul was only signed. An interesting point.


why doesn't gcc use mul for unsigned multiplication?

Since the mul / imul single operand is slower (instead of 2 Intel processors instead of 2, according to the Agner Fog insn tables . tag wiki). They also use more registers: they require one of their entries in rax and produce their outputs in rdx:rax , so additional mov instructions are usually required to move data to / from these regs.

So imul r64, r64 is a better choice than mul r64 if you don't need the result of the flag.

On Intel imul r64,r64 is faster than mul r32 . This does not apply to some other processors, including the AMD Bulldozer family, where 64-bit multiplication is somewhat slower. But since mul r32 puts its results in edx:eax instead of a single destination register, in any case they are not direct replacements for each other.

+8
source

Source: https://habr.com/ru/post/1436185/


All Articles