Why is vpclmulqdq with a memory operand slower than movdqa + pclmulqdq?

Question

Why is vpclmulqdq with a memory operand slower than movdqa + pclmulqdq?

vpclmulqdqthe instruction has four operands and pclmulqdqhas three operands, so I think it vpclmulqdqcan be used instead movdqa + pclmulqdq, but the result of the experiments has become slower.

But when I use vpadddinstead movdqa + paddd, I get a faster result. Therefore, I am confused by this question. The code uses the instructions paddd:

movdqa %xmm0, %xmm8          # slower
movdqa %xmm0, %xmm9
movdqa %xmm0, %xmm10
movdqa %xmm0, %xmm11
paddd (ONE),  %xmm8
paddd (TWO),  %xmm9
paddd (THREE),  %xmm10
paddd (FOUR),  %xmm11

vpaddd (ONE), %xmm0, %xmm8   # faster
vpaddd (TWO), %xmm0, %xmm9
vpaddd (THREE), %xmm0, %xmm10
vpaddd (FOUR), %xmm0, %xmm11

The code uses pclmulqdq commands, for example:

movdqa %xmm15, %xmm1               # faster
pclmulqdq $0x00, (%rbp), %xmm1
aesenc 16(%r15), %xmm8
aesenc 16(%r15), %xmm9
aesenc 16(%r15), %xmm10
aesenc 16(%r15), %xmm11
movdqa %xmm14, %xmm3
pclmulqdq $0x00, 16(%rbp), %xmm3
aesenc 32(%r15), %xmm8
aesenc 32(%r15), %xmm9
aesenc 32(%r15), %xmm10
aesenc 32(%r15), %xmm11

vpclmulqdq $0x00, (%rbp), %xmm15, %xmm1   # slower
aesenc 16(%r15), %xmm8
aesenc 16(%r15), %xmm9
aesenc 16(%r15), %xmm10
aesenc 16(%r15), %xmm11
vpclmulqdq $0x00, 16(%rbp), %xmm14, %xmm3
aesenc 32(%r15), %xmm8
aesenc 32(%r15), %xmm9
aesenc 32(%r15), %xmm10
aesenc 32(%r15), %xmm11

Another question: when I use non-standard data, how to write code, for example pxor (%rdi), %xmm0? (Editor's note: removed from the title because this is a separate question and because there is no better answer than aligning the pointers for the main part of the loop.)

16- (2-) . , xor. :

pxor (%rdi), %xmm8     # would segfault from misaligned %rdi
pxor 16(%rdi), %xmm9
pxor 32(%rdi), %xmm10
pxor 48(%rdi), %xmm11

, , , , ?

movdqu (%rdi), %xmm0
movdqu 16(%rdi), %xmm13
movdqu 32(%rdi), %xmm14
movdqu 48(%rdi), %xmm15

pxor %xmm0, %xmm8
pxor %xmm13, %xmm9
pxor %xmm14, %xmm10
pxor %xmm15, %xmm11

+4

assembly x86 sse avx micro-optimization

Bai 25 . '17 8:30

:

54

IACA ?

32

12

x86 MOV ""? ?

10

: VMASKMOVPS: ? insn

: