Clang vs gcc to copy 3 bytes to x86_64 - the number of mobs

What should optimize the compiled code to copy 3 bytes from one place to another, say, using memcpy(,,3) , as shown in the assembly instructions?

Consider the following program:

 #include <string.h> int main() { int* p = (int*) 0x10; int x = 0; memcpy(&x, p, 4); x = x * (x > 1 ? 2 : 3); memcpy(p, &x, 4); return 0; } 

it is a little far-fetched a will lead to a violation of segmentation, but I need these instructions, so compiling with -O3 will not make it all go away. When I compile this (GodBolt, GCC 6.3 -O3), I get:

 main: mov edx, DWORD PTR ds:16 xor eax, eax cmp edx, 1 setle al add eax, 2 imul eax, edx mov DWORD PTR ds:16, eax xor eax, eax ret 

great - one mov DWORD (= 4 bytes) from memory to register. Nice and optimized. Now change memcpy(&x, p1, 4) to memcpy(&x, p1, 3) ? The compilation result will look like this:

 main: mov DWORD PTR [rsp-4], 0 movzx eax, WORD PTR ds:16 mov WORD PTR [rsp-4], ax movzx eax, BYTE PTR ds:18 mov BYTE PTR [rsp-2], al mov edx, DWORD PTR [rsp-4] xor eax, eax cmp edx, 1 setle al add eax, 2 imul eax, edx mov DWORD PTR ds:16, eax xor eax, eax ret 

I am not very good at building Intel X86_64 (read: I can’t even read it correctly when it was difficult), so - I didn’t quite understand this. I mean, I get what happens in the first 6 instructions, and why so many of them are needed. Why aren't two moves enough? A mov WORD PTR int al and a mov BYTE PTR in ah ?

... So, I came here to ask. When I wrote this question, I noticed that GodBolt also began to speak as an option. Well, clang (3.9.0-O3) does this:

 main: # @main movzx eax, byte ptr [18] shl eax, 16 movzx ecx, word ptr [16] or ecx, eax cmp ecx, 2 sbb eax, eax and eax, 1 or eax, 2 imul eax, ecx mov dword ptr [16], eax xor eax, eax ret 

which is more like what i expected. What explains the difference?

Notes:

  • This is the same behavior if I do not initialize x = 0 .
  • Other versions of GCC do roughly the same thing as GCC 6.3, but GCC 7 to 5 instead of 6 mov .
  • Other clang versions (starting with 3.4) do roughly the same thing.
  • The behavior is similar if we refuse memcpy'ing for the following:

     #include <string.h> typedef struct { unsigned char data[3]; } uint24_t; int main() { uint24_t* p = (uint24_t*) 0x30; int x = 0; *((uint24_t*) &x) = *p; x = x * (x > 1 ? 2 : 3); *p = *((uint24_t*) &x); return 0; } 
  • If you want to compare with what happens when the corresponding code is in a function, see this or the version of uint24_t struct (GodBolt). Then see what happens for 4-byte values .

+5
source share
3 answers

Size three is an ugly size, and compilers are not perfect.

The compiler cannot create access to a memory address that you did not request, so it needs two moves.

While it seems trivial to you, remember that you requested memcpy(&x, p, 4); , which is a copy from memory to memory.
Obviously, GCC and older versions of Clang are not smart enough to understand this; there is no reason to switch to temporary memory.

What GCC does with the first six instructions, basically creates a DWORD in [rsp-4] with three bytes, as you requested

 mov DWORD PTR [rsp-4], 0 ;DWORD is 0 movzx eax, WORD PTR ds:16 ;EAX = byte 0 and byte 1 mov WORD PTR [rsp-4], ax ;DWORD has byte 0 and byte 1 movzx eax, BYTE PTR ds:18 ;EAX = byte 2 mov BYTE PTR [rsp-2], al ;DWORD has byte 0, byte 1 and byte 2 mov edx, DWORD PTR [rsp-4] ;As previous from henceon 

Used movzx eax, ... to prevent partial register failure.

Compilers did a great job executing the memcpy call, and as you said, this example is “a little contrived” to follow, even for humans. memcpy optimization should work for any size, including those that cannot be case sensitive. It is not so simple that everything is correct.

Given that L1 access delays have decreased significantly in recent architectures and that [rsp-4] is likely to be in the cache, I’m not sure if it is worth messing with the optimization code in the GCC source.
Of course, it’s worth making an error for the missed optimization and see what the developers have to say.

+4
source

You should get much better code from copying 4 bytes and masking the top, for example. with x & 0x00ffffff . This allows the compiler to know that it is allowed to read 4 bytes, not just 3, read by source C.

Yes, that helps a ton: it saves gcc and clang from storing zero 4B, then copies three bytes and reloads 4. They just load 4, mask, save and use a value that is still in the register. Part of this may be from ignorance if * p aliases * q.

 int foo(int *p, int *q) { //*p = 0; //memcpy(p, q, 3); *p = (*q)&0x00ffffff; return *p; } mov eax, DWORD PTR [rsi] # load and eax, 16777215 # mask mov DWORD PTR [rdi], eax # store ret # and leave it in eax as return value 

Why aren't two moves enough? A Mov WORD PTR in al followed by the BYTE PTR bit in ah ?

AL and AH are 8-bit registers. You cannot put a 16-bit word in AL. This is why your last clang-output block loads two separate registers and combines with the + or shift, in the case when it knows that it is allowed to work with all 4 bytes of x .

If you combine two separate single-byte values, you can load them into AL and AH and then use AX, but this will lead to a partial registration table on Intel pre-Haswell.

You can do text loading in AX (or, preferably, movzx in eax for various reasons, including correctness and avoiding false dependence on the old EAX value), left-shift EAX, and then byte load in AL.

But compilers are not inclined to do this, because partial registration material has been very bad juju for many years and is only effective for the latest processors (Haswell and, possibly, IvyBridge). This will cause serious stalls on Nehalem and Core2. (See Agar Fog microarch pdf , find the partial register or find it in the index. See Other links in the tag wiki.) Perhaps in a few years, -mtune=haswell will allow partial trick registers to store the OR instruction that clang uses to combine.


Instead of writing such a contrived function:

Write down functions that take arguments and return a value, so you don’t need to make them super-weird in order not to optimize . for example, a function that takes two int * args and does 3 bytes of memcpy between them.

This is on Godbolt (with gcc and clang) with color highlighting

 void copy3(int *p, int *q) { memcpy(p, q, 3); } clang3.9 -O3 does exactly what you expected: a byte and a word copy. mov al, byte ptr [rsi + 2] mov byte ptr [rdi + 2], al movzx eax, word ptr [rsi] mov word ptr [rdi], ax ret 

To get the stupid thing you managed to create, first have a zero destination, and then read it after a three-byte copy:

 int foo(int *p, int *q) { *p = 0; memcpy(p, q, 3); return *p; } clang3.9 -O3 mov dword ptr [rdi], 0 # *p = 0 mov al, byte ptr [rsi + 2] mov byte ptr [rdi + 2], al # byte copy movzx eax, word ptr [rsi] mov word ptr [rdi], ax # word copy mov eax, dword ptr [rdi] # read the whole thing, causing a store-forwarding stall ret 

gcc does nothing better (except for processors that do not rename partial registers, since it avoids the false dependency on the old EAX value by using movzx to copy the byte).

+7
source

(not a real answer, since I can’t add anything to what others have already answered, so it's just an example of how I will do this code manually ... maybe mainly for my own curiosity)

If functions:

f (24b unsigned n):

  • f (0) → 0
  • f (1) → 3
  • f (n) → n * 2, n> 1

(looks at me to be out of your question).

Then I would compile in writing (nasm syntax):

  mov eax,[16] ; reads 4 bytes from address 16 ; f(n) starts here, n = low 24b of eax, modifies edx xor edx,edx and eax,0x00FFFFFF dec eax setz dl lea eax,[edx+2*eax+2] ; output = low 24b of eax, b24..b31 undefined ; writes 3 bytes back to address 16 mov [16],ax shr eax,16 mov [18],al 
0
source

Source: https://habr.com/ru/post/1262119/


All Articles