Does using mixor and xorps use performance?

Question

Does using mixor and xorps use performance?

I came across a quick CRC calculation using the PCLMULQDQ implementation . I see that the guys mix the teams pxorand xorpsstrongly, as in the following fragment:

movdqa  xmm10, [rk9]
movdqa  xmm8, xmm0
pclmulqdq xmm0, xmm10, 0x11
pclmulqdq xmm8, xmm10, 0x0
pxor  xmm7, xmm8
xorps xmm7, xmm0

movdqa  xmm10, [rk11]
movdqa  xmm8, xmm1
pclmulqdq xmm1, xmm10, 0x11
pclmulqdq xmm8, xmm10, 0x0
pxor  xmm7, xmm8
xorps xmm7, xmm1

Is there any practical reason for this? Productivity increase? If so, what lies beneath this? Or maybe it's just a kind of coding style, for fun?

+4

assembly x86 sse simd

Alexander Zhak 01 Oct '16 at 21:21

source share

1 answer

Peter Cordes · Accepted Answer · 2016-10-03T09:59:24+0000

TL: DR: , , - . " ", .

, @Iwillnotexist Idonotexist : , , . : REX , 8.

XORPS "float" Intel (Nehalem ), PXOR "ivec".

ALU ALU , CPU . ( ). 1 ( Intel SnB) 2 (Nehalem).

: SSE-?

:

, , PXOR XORPS parallelism, . ( : PXOR ALU-, XORPS ).
, , , PCLMULQDQ. (, EOF, / - ).

: "2011-2015 Intel", , - Intel, , Intel. Nehalem , PCLMULQDQ, Intel, , - AMD. git, 6 .

Intel ( 2009 .) , PXOR, XORPS, 2x pclmul/2x xor block.

Agner Fog uops PCLMULQDQ Nehalem . 12c 8c , Sandy/Ivybridge 18 uop. Haswell 3 uops (2p0 p5), 1 uop Broadwell (p0) Skylake (p5).

XORPS port5 ( Skylake, ALU ). Nehalem 2c , PXOR. SnB Agner Fog :

.

, , PXOR → XORPS SnB, , 5. Nehalem XORPS , PSHUFB.

PSHUFB XOR, PCLMUL. SnB/IvB p1/p5 ( Haswell , p5 , 256b AVX2).

, PCLMUL, , - / , SnB.

, PCLMULQDQ 4-, . , PCLMULQDQ uop . 3 uop 32B- x86, , - SnB/IvB. uop . Intel:

( uop) , EIP 32- .

, DIV : Intel SnB-, uop (DSB Intel). @Iwillnotexist Idonotexist Haswell , , loopback. (LSD Intel).

Haswell PCLMULQDQ , uop .

, , , uop . OTOH, uop , .

IDK, . , SnB Skylake, , SKL PCLMUL.

Does using mixor and xorps use performance?

More articles: