Slow XOR operator

EDIT: Indeed, I had a strange error in the time code leading to these results. When I fixed my mistake, the smart version was faster than expected. My time code looked like this:

bool x = false; before = now(); for (int i=0; i<N; ++i) { x ^= smart_xor(A[i],B[i]); } after = now(); 

I did ^= to keep my compiler from optimizing for-loop. But I think ^= is somehow weirdly interacting with two xor functions. I modified the time code to just populate the xor result array, and then do the calculation with that array outside of the programmed code. And these are fixed things.

Delete this question?

End edit

I defined two C ++ functions as follows:

 bool smart_xor(bool a, bool b) { return a^b; } bool dumb_xor(bool a, bool b) { return a?!b:b; } 

My time tests show that dumb_xor() slightly faster (1.31ns versus 1.90ns when inlined, 1.92ns versus 2.21ns when not nested). This puzzles me, since the ^ operator must be a single machine operation. I am wondering if anyone has an explanation.

The assembly is as follows (if not specified):

  .file "xor.cpp" .text .p2align 4,,15 .globl _Z9smart_xorbb .type _Z9smart_xorbb, @function _Z9smart_xorbb: .LFB0: .cfi_startproc .cfi_personality 0x3,__gxx_personality_v0 movl %esi, %eax xorl %edi, %eax ret .cfi_endproc .LFE0: .size _Z9smart_xorbb, .-_Z9smart_xorbb .p2align 4,,15 .globl _Z8dumb_xorbb .type _Z8dumb_xorbb, @function _Z8dumb_xorbb: .LFB1: .cfi_startproc .cfi_personality 0x3,__gxx_personality_v0 movl %esi, %edx movl %esi, %eax xorl $1, %edx testb %dil, %dil cmovne %edx, %eax ret .cfi_endproc .LFE1: .size _Z8dumb_xorbb, .-_Z8dumb_xorbb .ident "GCC: (Ubuntu 4.4.3-4ubuntu5) 4.4.3" .section .note.GNU-stack,"",@progbits 

I am using g ++ 4.4.3-4ubuntu5 on Intel Xeon X5570. I compiled with -O3.

+6
source share
2 answers

I do not think that you correctly rated your code.

In the generated assembly, we see that your smart_xor function:

 movl %esi, %eax xorl %edi, %eax 

while your dumb_xor function:

 movl %esi, %edx movl %esi, %eax xorl $1, %edx testb %dil, %dil cmovne %edx, %eax 

So, obviously, the first will be faster.
If not, then you have problems with benchmarking.

So you can set up your benchmarking. And remember that you will need to launch many challenges in order to have a good and meaningful average.

+5
source

Given that your "dumb XOR" code is much longer (and most instructions depend on the previous one, so it will not work in parallel), I suspect that you have some kind of measurement error in your results.

The compiler will have to create two instructions for the off-level version of the smart XOR, because the registers that receive the data are not a register to give a return result, so the data must move from EDI and ESI to EAX. In the embedded version, the code should be able to use any data register before the call, and if the code allows this, the result remains in the register to which it arrived.

An off-line function call is probably at least as long at runtime as the actual code in the function.

This will help if you show your test harness, which you also use for benchmarking ...

+4
source

Source: https://habr.com/ru/post/943730/


All Articles