Why is the assembly not equivalent to returning by reference and copying on paste?

I have a small structure:

pub struct Foo { pub a: i32, pub b: i32, pub c: i32, } 

I used field pairs in the form (a,b) (b,c) (c,a) . To avoid code duplication, I created a utility function that would allow me to iterate over pairs:

 fn get_foo_ref(&self) -> [(&i32, &i32); 3] { [(&self.a, &self.b), (&self.b, &self.c), (&self.c, &self.a)] } 

I had to decide if I should return the values ​​as links or copy i32 . Later I plan to switch to Copy type instead of i32 , so I decided to use links. I expected that the resulting code should be equivalent, since everything will be included.

I am generally optimistic about the optimization, so I suspected that the code would be equivalent when using this function compared to hand-written examples.

First, an option using the function:

 pub fn testing_ref(f: Foo) -> i32 { let mut sum = 0; for i in 0..3 { let (l, r) = f.get_foo_ref()[i]; sum += *l + *r; } sum } 

Then the handwritten version:

 pub fn testing_direct(f: Foo) -> i32 { let mut sum = 0; sum += fa + fb; sum += fb + fc; sum += fc + fa; sum } 

To my disappointment, all 3 methods led to another assembler. The worst code was created for the linking case, and the best code was the one that didn't use my utility function at all. Why is this? Should the compiler generate equivalent code in this case?

You can view the resulting assembly code on Godbolt ; I also have "equivalent" assembly code from C ++ .

In C ++, the compiler generated equivalent code between get_foo and get_foo_ref , although I do not understand why the code for all three cases is not equivalent.

Why didn't the compiler create equivalent code for all three cases?

Update :

I modified the code a bit to use arrays and added another direct case.
Rust version with f64 and arrays
C ++ version with f64 and arrays
This time, the generated code between C ++ is exactly the same. However, the Rust build is different, and returning by reference leads to a worse build.

Well, I think this is another example that nothing can be taken for granted.

+6
source share
1 answer

TL DR: Microbenchmarks are deceiving, the number of instructions is not directly converted to high / low performance.


In the future, I plan to switch to the non-Copy type instead of i32, so I decided to use links.

Then you should check the generated assembly for your new type.

In your optimized example, the compiler is very insidious:

 pub fn testing_direct(f: Foo) -> i32 { let mut sum = 0; sum += fa + fb; sum += fb + fc; sum += fc + fa; sum } 

Productivity:

 example::testing_direct: push rbp mov rbp, rsp mov eax, dword ptr [rdi + 4] add eax, dword ptr [rdi] add eax, dword ptr [rdi + 8] add eax, eax pop rbp ret 

Roughly sum += fa; sum += fb; sum += fc; sum += sum; sum += fa; sum += fb; sum += fc; sum += sum; .

That is, the compiler realized that:

  • fX added twice
  • fX * 2 equivalent to adding it twice

While the former may be blocked in other cases by indirect access, the latter is VERY specific for i32 (and the addition is commutative).

For example, switching your code to f32 (still Copy , but adding is no longer commutative), I get the same assembly for testing_direct and testing (and is slightly different for testing_ref ):

 example::testing: push rbp mov rbp, rsp movss xmm1, dword ptr [rdi] movss xmm2, dword ptr [rdi + 4] movss xmm0, dword ptr [rdi + 8] movaps xmm3, xmm1 addss xmm3, xmm2 xorps xmm4, xmm4 addss xmm4, xmm3 addss xmm2, xmm0 addss xmm2, xmm4 addss xmm0, xmm1 addss xmm0, xmm2 pop rbp ret 

And there is no trick.

So you really cannot conclude from your example, check the real type.

+2
source

Source: https://habr.com/ru/post/1014745/


All Articles