Why is 64 bit in x86 assembly?

Why idiv x86 assembly instruction divide EDX:EAX (64 bits) into a given register, while other mathematical operations, including multiplication, just work on single input and output registers?

Multiplication:

 mov eax, 3 imul eax, 5 

Section:

 mov edx, 0 mov eax, 15 mov ebx, 5 idiv ebx 

I know that EDX used to store the remainder, but why is there no separate instruction for this behavior? It seems inconsistent to me.

+4
source share
4 answers

The instruction set contains the instructions necessary for the efficient execution of arithmetic with arbitrary width. To add and subtract, all you need to know for this outside of a fixed width is whether the operation resulted in a carry (to add) or a borrow (to subtract). That is why there is a carry flag. To multiply you need to be able to multiply two words and get the result of a double word. This is why imul gives the result in edx:eax . For division, you should be able to divide the double-width number and get the coefficient and the remainder.

To understand why you need these specific operations, see Knuth Art of Computer Programming, Volume 2, which details algorithms for implementing arbitrary arithmetic.

As to why the x86 instruction set does not have more different forms of multiplication and division instructions, multiplication and division that are not the power of two are much less common than other instructions, so Intel probably doesn't want to use opcodes that can be used for instructions, to be used more often. Most of the multiplications and sections in general-purpose programs are by two; for them you can use bitgifts or the lea instruction.

+6
source

It also uses double-width multiplication (one operand of mul or imul ).

If you ask: “Why are there no two idiv operands that give only the factor”, then I really don’t know (I have a theory, but I do not work for Intel), and I would like this to exist.

This works well when you want to do modular multiplication with a module that is not the power of two, you can make a mul and follow it directly with a div and everything is already in the right place. This is the result, not the reason, and for this reason we should ask Intel .. but here is the theory. Back in 8086 there was only a doubling of the width (and it was a slow type of iterative multiplication with an early output, as in the software). Later, with 80286, they added several more flexible multiplications, but they never did the same for division. Perhaps this was not so acute, after all, divisions are relatively rare, while you often need multiplications by small constants, for example, to index arrays of structures.

+4
source

For addition and subtraction, your overflow is one bit processed by the carry flag. If you had to take two arbitrary N bits of the operand and multiply them, you need 2 * N bits to save the result, very simple, try yourself 0xFF * 0xFF = 0xFE01. If you used only registers with a size of N bits, the multiplication command will be extremely limited. Splitting is the opposite of multiplying 2 * N bits into which you get N bits. If you are worried about N bits * N bits = 2 * N number of bits, then you should also implement 2 * N number of bits / N number of bits = N number of bits. That's why, unfortunately, the hardware does more than languages, languages ​​need to know and do this, if I multiply two bytes, the compiler should complain about accuracy if my resulting variable is less than 16 bits. At the same time, any programmer who uses the operations of addition, subtraction, multiplication or division must also be aware of overflow, and the use of these languages ​​uses variables that are twice the width of the operands, so they do not overflow ...

+2
source

There are two questions here. Firstly, the question arises of double-width input or output, and you ignore the single-operator MUL / IMUL forms that perform full expansion of multiplication, including the high half of the result: N * N => 2N bits, making EDX:EAX = EAX * src . See Other Answers for why this is helpful.

BMI2 even introduced a more flexible instruction with full MULX multiplication, which has three explicit operands (two outputs and one input) and only one implicit operand (second source = EDX).


Secondly, you give an example of using the direct operand, which is also not available for DIV / IDIV, and no one mentioned it.

There is one obscure instruction, which is actually a direct-div, which makes 8 bits / imm 8 => 8-bit coefficient / remainder, not 16/8 => 8. It is called AAM and is not available in 64-bit mode. Assemblers by default divide by 10 (for the intended use case of BCD), but this is the same opcode with any imm8. Here's how to use a DIV or AAM to turn an integer 0-99 into two ASCII digits , also pointing out the many subtle differences between AAM and DIV r/m8 .

Intel could add immediate versions of IDIV at any time, but it never did. I assume that the DIV / IDIV is rather slow (and rare enough), that the additional overhead of mov reg, imm32 negligible and that the cost of the operation space (and decoder transistors) on such an instruction was never considered worthy of this.


More importantly, actually dividing hardware into compile-time constant is usually only useful for code size, not performance. Modular multiplicative inversions have been well known (compiler compilers) since the 90s . Since compilers did not even use constant division, Intel was unlikely to add instructions for processors developed after this technique became known. for example, clang compiles unsigned int div10(unsigned int a) { return a/10; } unsigned int div10(unsigned int a) { return a/10; } in

  mov ecx, edi # just to zero-extend to 64-bit mov eax, 3435973837 # a sign-extended imm32 can't represent this constant, I guess. clang uses imul r,r,imm for other cases. imul rax, rcx # 64-bit multiply instead of 32x32 => 64 in two separate regs shr rax, 35 # extract part of the high-half result. ret 

A few more instructions are needed for signed division, and sometimes addition / subtraction with the results for less simple divisors. See some examples on Godbolt . However, it is faster than hardware separation instructions, which are very slow, for example, 22-29 latency cycles for Haswell DIV r64 with poor bandwidth


If they were planning to spend opcodes (and decoder transistors / power) on additional instructions, a two-part IDIV form with a single-width dividend could be useful for compilers .

I don’t know much about how hardware dividers are implemented internally, therefore IDK, if there is savings that can only be made from N / N => N bit division, instead of the usual 2N / N => N. In the output compiler, almost everything divisions are performed after CDQ or xor edx,edx . The unit is variable latency for many x86 microarchitectures, so if there was any kind of acceleration when the dividend really was only N bits, presumably the hardware is already looking for this. However, Skylake DIV / IDIV r32 are constant 26c latency (but the 64-bit divider is much slower and still has a very variable delay).

Presumably, the DIV r32, r32 will still DIV r32, r32 2 outputs (quotient and remainder), I think, in two input registers? Therefore, you often need additional MOV instructions to save your inputs. Or, perhaps, you need to immediately select a factor or balance to go to one destination or use two separate opcode for residual / residual?

At this point, they can add a version with a VEX code that is a bit like MULX with three explicit operands. However, the proposed use case for MULX allows you to multiply extended accuracy by alternating with extended accuracy with the addition of hyphenation, so a DIVX r64(quotient), r64(remainder), r/m64(divisor) (with an implicit dividend in RDX?) Will differ significantly ( less useful for extended accuracy). They'll probably make the implicit dividend of RDX: RAX anyway. Or maybe they would not even call it DIVX, as it is already a trademark for a video codec / company

0
source

Source: https://habr.com/ru/post/1436182/


All Articles