ZeroMemory in SSE

I need a simple ZeroMemory implementation with SSE (preferably SSE2) Can someone help with this. I was angry through SO and net, but did not find a direct answer to this.

+2
optimization assembly x86 sse
Oct 08 '12 at 17:50
source share
3 answers

Is ZeroMemory() or memset() not good enough?

Disclaimer: Some of the following may be SSE3.

  • Fill in any unbalanced leading bytes, looping until the address is a multiple of 16
  • push to save registry xmm
  • pxor to zero xmm reg
  • While the remaining length is> = 16,
    • movdqa or movntdq record
  • pop to restore case xmm.
  • Fill in any unbalanced trailing bytes.

movntdq may seem faster because it tells the processor not to cast data to your cache, but it can lead to performance degradation later if the data is used. This might be more appropriate if you clear the memory before freeing it (for example, you can do with SecureZeroMemory() ).

+4
Oct 08 '12 at 18:15
source share
β€” -

I want to speed up your code, than you need to understand exactly how your processor works and where this bottleneck is.

Here you use my optimized speed to show how to do this.

On my PC, it’s about 5 times faster (clear a block of 1 MB of memory) than yours, check it and ask if something is clear:

 //edx = memory pointer must be 16 bytes aligned //ecx = memory count must be multiple of 16 xorps xmm0, xmm0 //Clear xmm0 mov eax, ecx //Save ecx to eax and ecx, 0FFFFFF80h //Clear only 128 byte pages jz @ClearRest //Less than 128 bytes to clear @Aligned128BMove: movdqa [edx], xmm0 //Clear first 16 bytes of 128 bytes movdqa [edx + 10h], xmm0 //Clear second 16 bytes of 128 bytes movdqa [edx + 20h], xmm0 //... movdqa [edx + 30h], xmm0 movdqa [edx + 40h], xmm0 movdqa [edx + 50h], xmm0 movdqa [edx + 60h], xmm0 movdqa [edx + 70h], xmm0 add edx, 128 //inc mem pointer sub ecx, 128 //dec counter jnz @Aligned128BMove @ClearRest: and eax, 07Fh //Clear the rest jz @Exit @LoopRest: movdqa [edx], xmm0 add edx, 16 sub eax, 16 jnz @LoopRest @Exit: 
+1
Oct 9
source share

Almost all transistors in your CPU are used to make memory access as fast as possible. The processor is already doing an amazing job in all memory accesses, and the instructions work at a much faster speed than accessing the memory.

Thus, trying to beat a memset in most cases is futile because it is already limited by the speed of your memory (as others have mentioned). A.

0
12 Oct
source share



All Articles