I want to speed up your code, than you need to understand exactly how your processor works and where this bottleneck is.
Here you use my optimized speed to show how to do this.
On my PC, itβs about 5 times faster (clear a block of 1 MB of memory) than yours, check it and ask if something is clear:
//edx = memory pointer must be 16 bytes aligned //ecx = memory count must be multiple of 16 xorps xmm0, xmm0 //Clear xmm0 mov eax, ecx //Save ecx to eax and ecx, 0FFFFFF80h //Clear only 128 byte pages jz @ClearRest //Less than 128 bytes to clear @Aligned128BMove: movdqa [edx], xmm0 //Clear first 16 bytes of 128 bytes movdqa [edx + 10h], xmm0 //Clear second 16 bytes of 128 bytes movdqa [edx + 20h], xmm0 //... movdqa [edx + 30h], xmm0 movdqa [edx + 40h], xmm0 movdqa [edx + 50h], xmm0 movdqa [edx + 60h], xmm0 movdqa [edx + 70h], xmm0 add edx, 128 //inc mem pointer sub ecx, 128 //dec counter jnz @Aligned128BMove @ClearRest: and eax, 07Fh //Clear the rest jz @Exit @LoopRest: movdqa [edx], xmm0 add edx, 16 sub eax, 16 jnz @LoopRest @Exit:
Gj. Oct 9 2018-12-12T00: 00Z
source share