How can I multiply two hexadecimal 128-bit numbers in an assembly

Question

How can I multiply two hexadecimal 128-bit numbers in an assembly

I have two 128-bit numbers in memory in hexadecimal, for example (a bit endian):

x:0x12 0x45 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 y:0x36 0xa1 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00

I have to do an unsigned multiplication between these two numbers so that my new number is:

 z:0xcc 0xe3 0x7e 0x2b 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00

Now I know that I can move half x and y to the rax and rbx and, for example, perform the mul operation and do the same with the other half. The problem is that by doing this, I lose the transfer, and I have no idea how I can avoid this. After about 4 hours, I ran into this problem, and the only solution I can see is binary conversion ( and ↔ shl,1 ).

Can you give me some information about this problem?
I believe that the best solution is to switch to one byte in time.

+5

assembly algorithm x86-64 byte multiplication

Davide Nov 24 '16 at 10:55

source share

2 answers

Let & mu; = 2 ⁶⁴ then we can split your 128-bit numbers a and b into a = a ₁ & mu; + a ₂ and b = b ₁ ? + b ₂ . Then we can calculate c = ab with 64 · 64 → 128 bits of multiplication, first computing partial products:

d <sub> 1sub> & mu; + q ₂ = a ₂ b ₂
g <sub> 1sub> & mu; + r ₂ = a ₁ b ₂
s <sub> 1sub> & mu; + s ₂ = a ₂ b ₁
t <sub> 1sub> & mu; + t ₂ = a ₁ b ₁

and then accumulating them into a 256-bit result (observe overflow when performing add-ons!):

c = t ₁ ? ³ + (t ₂ + s ₁ + r ₁ )? Thinsp; & mu; ² + (s ₂ + r ₂ + q ₁ )? Thinsp; & mu; + q ₂

+8

fuz Nov 24 '16 at 23:11

source share

Peter Cordes · Accepted Answer · 2016-11-25T04:36:22+0000

As usual, ask the compiler how to do something efficiently : GNU C supports __int128_t and __uint128_t on 64-bit platforms.

 __uint128_t mul128(__uint128_t a, __uint128_t b) { return a*b; }

compiles ( gcc6.2 -O3 in Godbolt )

  imul rsi, rdx # tmp94, b mov rax, rdi # tmp93, a imul rcx, rdi # tmp95, a mul rdx # b add rcx, rsi # tmp96, tmp94 add rdx, rcx #, tmp96 ret

Since this is targeting the System 86 x86-64 calling convention, a is in RSI: RDI, and b is in RCX: RDX. The result is returned to RDX: RAX .

It is pretty elegant that this requires only one MOV command, since gcc does not need the high end of a_upper * b_lower or vice versa. It can destroy high halves of inputs with a faster 2-operand form IMUL, since they are used only once.

Using -march=haswell to enable BMI2, gcc uses MULX to avoid even a single MOV.

Sometimes the compiler’s output is not perfect, but very often the overall strategy is a good starting point for manual optimization.

Of course, if what you really wanted in the first place was 128-bit multiplication by C, just use the built-in compiler support. This allows the optimizer to do its job, often yielding better results than if you wrote a couple of inline-asm parts. ( https://gcc.gnu.org/wiki/DontUseInlineAsm ).

How can I multiply two hexadecimal 128-bit numbers in an assembly

More articles: