As usual, ask the compiler how to do something efficiently : GNU C supports __int128_t and __uint128_t on 64-bit platforms.
__uint128_t mul128(__uint128_t a, __uint128_t b) { return a*b; }
compiles ( gcc6.2 -O3 in Godbolt )
imul rsi, rdx
Since this is targeting the System 86 x86-64 calling convention, a is in RSI: RDI, and b is in RCX: RDX. The result is returned to RDX: RAX .
It is pretty elegant that this requires only one MOV command, since gcc does not need the high end of a_upper * b_lower or vice versa. It can destroy high halves of inputs with a faster 2-operand form IMUL, since they are used only once.
Using -march=haswell to enable BMI2, gcc uses MULX to avoid even a single MOV.
Sometimes the compiler’s output is not perfect, but very often the overall strategy is a good starting point for manual optimization.
Of course, if what you really wanted in the first place was 128-bit multiplication by C, just use the built-in compiler support. This allows the optimizer to do its job, often yielding better results than if you wrote a couple of inline-asm parts. ( https://gcc.gnu.org/wiki/DontUseInlineAsm ).
source share