How can I multiply two hexadecimal 128-bit numbers in an assembly

I have two 128-bit numbers in memory in hexadecimal, for example (a bit endian):

x:0x12 0x45 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 y:0x36 0xa1 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 

I have to do an unsigned multiplication between these two numbers so that my new number is:

 z:0xcc 0xe3 0x7e 0x2b 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 

Now I know that I can move half x and y to the rax and rbx and, for example, perform the mul operation and do the same with the other half. The problem is that by doing this, I lose the transfer, and I have no idea how I can avoid this. After about 4 hours, I ran into this problem, and the only solution I can see is binary conversion ( andshl,1 ).

Can you give me some information about this problem?
I believe that the best solution is to switch to one byte in time.

+5
source share
2 answers

As usual, ask the compiler how to do something efficiently : GNU C supports __int128_t and __uint128_t on 64-bit platforms.

 __uint128_t mul128(__uint128_t a, __uint128_t b) { return a*b; } 

compiles ( gcc6.2 -O3 in Godbolt )

  imul rsi, rdx # tmp94, b mov rax, rdi # tmp93, a imul rcx, rdi # tmp95, a mul rdx # b add rcx, rsi # tmp96, tmp94 add rdx, rcx #, tmp96 ret 

Since this is targeting the System 86 x86-64 calling convention, a is in RSI: RDI, and b is in RCX: RDX. The result is returned to RDX: RAX .

It is pretty elegant that this requires only one MOV command, since gcc does not need the high end of a_upper * b_lower or vice versa. It can destroy high halves of inputs with a faster 2-operand form IMUL, since they are used only once.

Using -march=haswell to enable BMI2, gcc uses MULX to avoid even a single MOV.


Sometimes the compiler’s output is not perfect, but very often the overall strategy is a good starting point for manual optimization.


Of course, if what you really wanted in the first place was 128-bit multiplication by C, just use the built-in compiler support. This allows the optimizer to do its job, often yielding better results than if you wrote a couple of inline-asm parts. ( https://gcc.gnu.org/wiki/DontUseInlineAsm ).

+6
source

Let & mu; = 2 64 then we can split your 128-bit numbers a and b into a = a 1 & mu; + a 2 and b = b 1 ? + b 2 . Then we can calculate c = ab with 64 · 64 → 128 bits of multiplication, first computing partial products:

d <sub> 1sub> & mu; + q 2 = a 2 b 2
g <sub> 1sub> & mu; + r 2 = a 1 b 2
s <sub> 1sub> & mu; + s 2 = a 2 b 1
t <sub> 1sub> & mu; + t 2 = a 1 b 1

and then accumulating them into a 256-bit result (observe overflow when performing add-ons!):

c = t 1 ? 3 + (t 2 + s 1 + r 1 )? Thinsp; & mu; 2 + (s 2 + r 2 + q 1 )? Thinsp; & mu; + q 2

+8
source

Source: https://habr.com/ru/post/1260312/


All Articles