vpclmulqdq
the instruction has four operands and pclmulqdq
has three operands, so I think it vpclmulqdq
can be used instead movdqa + pclmulqdq
, but the result of the experiments has become slower.
But when I use vpaddd
instead movdqa + paddd
, I get a faster result. Therefore, I am confused by this question. The code uses the instructions paddd
:
movdqa %xmm0, %xmm8
movdqa %xmm0, %xmm9
movdqa %xmm0, %xmm10
movdqa %xmm0, %xmm11
paddd (ONE), %xmm8
paddd (TWO), %xmm9
paddd (THREE), %xmm10
paddd (FOUR), %xmm11
vpaddd (ONE), %xmm0, %xmm8
vpaddd (TWO), %xmm0, %xmm9
vpaddd (THREE), %xmm0, %xmm10
vpaddd (FOUR), %xmm0, %xmm11
The code uses pclmulqdq commands, for example:
movdqa %xmm15, %xmm1
pclmulqdq $0x00, (%rbp), %xmm1
aesenc 16(%r15), %xmm8
aesenc 16(%r15), %xmm9
aesenc 16(%r15), %xmm10
aesenc 16(%r15), %xmm11
movdqa %xmm14, %xmm3
pclmulqdq $0x00, 16(%rbp), %xmm3
aesenc 32(%r15), %xmm8
aesenc 32(%r15), %xmm9
aesenc 32(%r15), %xmm10
aesenc 32(%r15), %xmm11
vpclmulqdq $0x00, (%rbp), %xmm15, %xmm1
aesenc 16(%r15), %xmm8
aesenc 16(%r15), %xmm9
aesenc 16(%r15), %xmm10
aesenc 16(%r15), %xmm11
vpclmulqdq $0x00, 16(%rbp), %xmm14, %xmm3
aesenc 32(%r15), %xmm8
aesenc 32(%r15), %xmm9
aesenc 32(%r15), %xmm10
aesenc 32(%r15), %xmm11
Another question: when I use non-standard data, how to write code, for example pxor (%rdi), %xmm0
? (Editor's note: removed from the title because this is a separate question and because there is no better answer than aligning the pointers for the main part of the loop.)
16- (2-) . , xor. :
pxor (%rdi), %xmm8
pxor 16(%rdi), %xmm9
pxor 32(%rdi), %xmm10
pxor 48(%rdi), %xmm11
, , , , ?
movdqu (%rdi), %xmm0
movdqu 16(%rdi), %xmm13
movdqu 32(%rdi), %xmm14
movdqu 48(%rdi), %xmm15
pxor %xmm0, %xmm8
pxor %xmm13, %xmm9
pxor %xmm14, %xmm10
pxor %xmm15, %xmm11