I am working on an implementation of SHA-256 using Power8 built -ins . Performance a bit. I estimate that it is off for about 2 cycles per byte (cpb).
C / C ++ code for executing SHA on a block looks like this:
I will compile the program using GCC using -O3 and -mcpu=power8 on the ppc64-le machine. When I look at a showdown, I see several of them:
... 10000b0c: a6 03 09 7d mtctr r8 10000b10: 57 02 00 f0 xxswapd vs32,vs32 10000b14: 6b 04 00 10 vperm v0,v0,v0,v17 10000b18: 57 02 00 f0 xxswapd vs32,vs32 10000b1c: 99 57 00 7c stxvd2x vs32,0,r10 10000b20: 99 26 0c 7c lxvd2x vs32,r12,r4 10000b24: 57 02 00 f0 xxswapd vs32,vs32 10000b28: 6b 04 00 10 vperm v0,v0,v0,v17 10000b2c: 57 02 00 f0 xxswapd vs32,vs32 10000b30: 99 67 0a 7c stxvd2x vs32,r10,r12 10000b34: 99 26 0b 7c lxvd2x vs32,r11,r4 10000b38: 57 02 00 f0 xxswapd vs32,vs32 10000b3c: 6b 04 00 10 vperm v0,v0,v0,v17 10000b40: 57 02 00 f0 xxswapd vs32,vs32 10000b44: 99 5f 0a 7c stxvd2x vs32,r10,r11 10000b48: 99 26 05 7c lxvd2x vs32,r5,r4 10000b4c: 57 02 00 f0 xxswapd vs32,vs32 10000b50: 6b 04 00 10 vperm v0,v0,v0,v17 10000b54: 57 02 00 f0 xxswapd vs32,vs32 10000b58: 99 2f 0a 7c stxvd2x vs32,r10,r5 ...
vperm v0,v0,v0,v17 looks like dead commands because v0 not used after permutation.
What does vperm v0,v0,v0,v17 do?
C ++ source code is available at sha256-p8.cxx .
The source file was compiled using g++ -g3 -O3 -Wall -DTEST_MAIN -mcpu=power8 sha256-2-p8.cxx -o sha256-2-p8.exe .
Full disassembly s is available at PPC64 SHA-256 Disassembly .
I think the above snippet is being created by SHA256_SCHEDULE . I see the VectorShiftLeft ( vsldoi ) collection after the block in question.
To zero even more, I'm sure this is the first line for the first 16 words:
const uint8x16_p8 mask = {3,2,1,0, 7,6,5,4, 11,10,9,8, 15,14,13,12}; for (unsigned int i=0; i<16; i+=4) VectorStore32x4u(VectorPermute32x4(VectorLoad32x4u(data, i*4), mask), W, i*4);
SHA256_SCHEDULE looks like this:
Here is the image of the highlighted partition with v0 highlighted.
