There are two questions here. Firstly, the question arises of double-width input or output, and you ignore the single-operator MUL / IMUL forms that perform full expansion of multiplication, including the high half of the result: N * N => 2N bits, making EDX:EAX = EAX * src . See Other Answers for why this is helpful.
BMI2 even introduced a more flexible instruction with full MULX multiplication, which has three explicit operands (two outputs and one input) and only one implicit operand (second source = EDX).
Secondly, you give an example of using the direct operand, which is also not available for DIV / IDIV, and no one mentioned it.
There is one obscure instruction, which is actually a direct-div, which makes 8 bits / imm 8 => 8-bit coefficient / remainder, not 16/8 => 8. It is called AAM and is not available in 64-bit mode. Assemblers by default divide by 10 (for the intended use case of BCD), but this is the same opcode with any imm8. Here's how to use a DIV or AAM to turn an integer 0-99 into two ASCII digits , also pointing out the many subtle differences between AAM and DIV r/m8 .
Intel could add immediate versions of IDIV at any time, but it never did. I assume that the DIV / IDIV is rather slow (and rare enough), that the additional overhead of mov reg, imm32 negligible and that the cost of the operation space (and decoder transistors) on such an instruction was never considered worthy of this.
More importantly, actually dividing hardware into compile-time constant is usually only useful for code size, not performance. Modular multiplicative inversions have been well known (compiler compilers) since the 90s . Since compilers did not even use constant division, Intel was unlikely to add instructions for processors developed after this technique became known. for example, clang compiles unsigned int div10(unsigned int a) { return a/10; } unsigned int div10(unsigned int a) { return a/10; } in
mov ecx, edi
A few more instructions are needed for signed division, and sometimes addition / subtraction with the results for less simple divisors. See some examples on Godbolt . However, it is faster than hardware separation instructions, which are very slow, for example, 22-29 latency cycles for Haswell DIV r64 with poor bandwidth
If they were planning to spend opcodes (and decoder transistors / power) on additional instructions, a two-part IDIV form with a single-width dividend could be useful for compilers .
I don’t know much about how hardware dividers are implemented internally, therefore IDK, if there is savings that can only be made from N / N => N bit division, instead of the usual 2N / N => N. In the output compiler, almost everything divisions are performed after CDQ or xor edx,edx . The unit is variable latency for many x86 microarchitectures, so if there was any kind of acceleration when the dividend really was only N bits, presumably the hardware is already looking for this. However, Skylake DIV / IDIV r32 are constant 26c latency (but the 64-bit divider is much slower and still has a very variable delay).
Presumably, the DIV r32, r32 will still DIV r32, r32 2 outputs (quotient and remainder), I think, in two input registers? Therefore, you often need additional MOV instructions to save your inputs. Or, perhaps, you need to immediately select a factor or balance to go to one destination or use two separate opcode for residual / residual?
At this point, they can add a version with a VEX code that is a bit like MULX with three explicit operands. However, the proposed use case for MULX allows you to multiply extended accuracy by alternating with extended accuracy with the addition of hyphenation, so a DIVX r64(quotient), r64(remainder), r/m64(divisor) (with an implicit dividend in RDX?) Will differ significantly ( less useful for extended accuracy). They'll probably make the implicit dividend of RDX: RAX anyway. Or maybe they would not even call it DIVX, as it is already a trademark for a video codec / company