Why is there no smooth multiple addition for general purpose registers on x86_64 processors?

Question

Why is there no smooth multiple addition for general purpose registers on x86_64 processors?

On Intel and AMD x86_64 processors, vectorized SIMD registers have special features with extended join-add, but universal (scalar, integer) registers don't - you basically need to multiply and then add (if you can't put things in lea).

Why? I mean, is it useless to not be worth the overhead?

+4

x86-64 intel cpu-architecture amd instruction-set

einpoklum Mar 13 '18 at 10:35

source share

1 answer

Peter Cordes · Answer 1 · 2018-03-13T11:28:41+0000

Integer multiplication is a common, but not one of the most common ways to do with integers. But with floating point numbers all the time, multiplication and addition are used, and FMA provides basic accelerations for a large number of FP-code tied to ALU.

In addition, the floating point actually avoids the loss of accuracy using FMA (the internal time limit x*yis not rounded up at all before adding). This is why the ISO C99 / C ++ math library function fma()exists and why it is slowly being implemented without the support of hardware FMA.

Integer FMA (or multiple accumulation, as well as MAC) does not have any advantages over separate multiplication and addition.

-x86 ISA FMA. , Intel AMD AVX512-IFMA ( SIMD, 52- , FMA/ vmulpd ).

, x86, :

MIPS32, madd/maddu ( ), - hi/lo ( , ).
ARM smlal (32x32 = > 64 MAC 16x16 = > 32 ), . R0..R15.

Integer FMA x86, uops 3 . CMOV ADC 3 , - . Intel Broadwell, , 3- uop FP FMA .

Haswell, uops 3 , () . Sandybridge/Ivybridge , add eax, [rdx+rcx]. ( Nehalem , , SnB uop ). , , . Broadwell/Skylake 3- 2 + , 3 .

Intel "" , FP integer FP FMA 3 . , IDK, . , IDK, Intel FMA BMI2 - , mulx (2-input 2- mul , mul, rdx:rax.)

SSE2/SSSE3 mul-add , 16x16 = > 32- (SSE2 pmaddwd) ( ) 8x () 8 = > 16- (SSSE3 pmaddubsw).

2 , , FMA.

: , FMA . FP FMA FMA3, : VFMADD231SD, , vfmaddXXXss XMM.

Why is there no smooth multiple addition for general purpose registers on x86_64 processors?

More articles: