Why does AES in SSE not provide a complete function?

Question

Why does AES in SSE not provide a complete function?

The Rijndael key schedule procedure includes RotWord , SubWord and XOR , all of which are supported by _ mm_aeskeygenist_si128 :

 X3[31:0] ← SRC [127: 96]; X2[31:0] ← SRC [95: 64]; X1[31:0] ← SRC [63: 32]; X0[31:0] ← SRC [31: 0]; RCON[31:0] ← ZeroExtend(Imm8[7:0]); DEST[31:0] ← SubWord(X1); DEST[63:32 ] ← RotWord( SubWord(X1) ) XOR RCON; DEST[95:64] ← SubWord(X3); DEST[127:96] ← RotWord( SubWord(X3) ) XOR RCON; DEST[VLMAX-1:128] (Unmodified)

However, it does not return the full round key. For example, instead of just doing

DEST[31:0] <- SubWord(X1) ,

I think we should really fulfill

DEST[31:0]<-RotWord(SubWord(X3)) XOR RCON XOR X0 .

As a result, after _mm_aeskeygenassist_si128 developers must do additional work before the round key is fully generated.

Why doesn't SSE provide a complete AES key generation procedure?

+5

assembly x86 sse aes instruction-set

Jason 01 Oct '17 at 13:30

source share

1 answer

Peter Cordes · Accepted Answer · 2017-10-01T19:25:08+0000

See Key Extension Using AESKEYGENASSIST (page 23) in the Intel AES-NI Technical Documentation. They indicate that the instruction can be used as a building block for different sizes of keys: 128/192/256. They only show example 128b, doing extra work with calling the function after every aeskeygenassist statement that you describe.

AESKEYGENASSIST is already microcoded (for example, 13 uops on Skylake versus 1 only for AESDEC / AESENC ( http://agner.org/optimize/ )), therefore having different instructions that follow the last few steps that differ for different sizes of keys, not would make it work much faster, as it was currently implemented.

Skylake has a 1 by 12 cycle aeskeygenassist , but Nehalem has 1 by 2 cycles, the same as AESENC . So in Nehalem, I think, they implemented it mainly in specialized equipment. Probably another part of the explanation: more steps would require more hardware in the first generation implementation or would make this instruction microcoded (which probably wasn’t in Nehalem) in order to complete the additional steps with more uops.

Intel obviously does not believe that key tuning is critical for performance because, as I said, they reduced the performance of aeskeygenassist after Nehalem. (Even Sandybridge was microcoded from 1 to 8 measures.)

Having different instructions for different keys, there would be more opcodes. At this point, Intel had not yet introduced VEX prefixes, so spending more opcodes on AES instructions would reduce the room for future extensions. (VEX has a ton of coding space, using only a couple of multi-bit codes for existing combinations of required prefixes used by current instructions.)

Why does AES in SSE not provide a complete function?

More articles: