What should optimize the compiled code to copy 3 bytes from one place to another, say, using memcpy(,,3) , as shown in the assembly instructions?
Consider the following program:
#include <string.h> int main() { int* p = (int*) 0x10; int x = 0; memcpy(&x, p, 4); x = x * (x > 1 ? 2 : 3); memcpy(p, &x, 4); return 0; }
it is a little far-fetched a will lead to a violation of segmentation, but I need these instructions, so compiling with -O3 will not make it all go away. When I compile this (GodBolt, GCC 6.3 -O3), I get:
main: mov edx, DWORD PTR ds:16 xor eax, eax cmp edx, 1 setle al add eax, 2 imul eax, edx mov DWORD PTR ds:16, eax xor eax, eax ret
great - one mov DWORD (= 4 bytes) from memory to register. Nice and optimized. Now change memcpy(&x, p1, 4) to memcpy(&x, p1, 3) ? The compilation result will look like this:
main: mov DWORD PTR [rsp-4], 0 movzx eax, WORD PTR ds:16 mov WORD PTR [rsp-4], ax movzx eax, BYTE PTR ds:18 mov BYTE PTR [rsp-2], al mov edx, DWORD PTR [rsp-4] xor eax, eax cmp edx, 1 setle al add eax, 2 imul eax, edx mov DWORD PTR ds:16, eax xor eax, eax ret
I am not very good at building Intel X86_64 (read: I can’t even read it correctly when it was difficult), so - I didn’t quite understand this. I mean, I get what happens in the first 6 instructions, and why so many of them are needed. Why aren't two moves enough? A mov WORD PTR int al and a mov BYTE PTR in ah ?
... So, I came here to ask. When I wrote this question, I noticed that GodBolt also began to speak as an option. Well, clang (3.9.0-O3) does this:
main: # @main movzx eax, byte ptr [18] shl eax, 16 movzx ecx, word ptr [16] or ecx, eax cmp ecx, 2 sbb eax, eax and eax, 1 or eax, 2 imul eax, ecx mov dword ptr [16], eax xor eax, eax ret
which is more like what i expected. What explains the difference?
Notes:
- This is the same behavior if I do not initialize
x = 0 . - Other versions of GCC do roughly the same thing as GCC 6.3, but GCC 7 to 5 instead of 6
mov . - Other clang versions (starting with 3.4) do roughly the same thing.
The behavior is similar if we refuse memcpy'ing for the following:
#include <string.h> typedef struct { unsigned char data[3]; } uint24_t; int main() { uint24_t* p = (uint24_t*) 0x30; int x = 0; *((uint24_t*) &x) = *p; x = x * (x > 1 ? 2 : 3); *p = *((uint24_t*) &x); return 0; }
If you want to compare with what happens when the corresponding code is in a function, see this or the version of uint24_t struct (GodBolt). Then see what happens for 4-byte values .