Is it possible to use char [] as parameters, return, etc. Any performance issues?

First, to make C ++ code more readable; I program the compiler and I gave it:

var swap = ( int x, y ) => { //Assign method that returns two ints, and gets two ints as parameter to variable named swap. var NewX = y var NewY = x } var increment = ( int x ) => { var Result = x + 1 } 

NOTE. Functions return any variable that is capitalized by the first letter. swap can be used as ... = swap( x, y ).NewX , but increment can be used as soon as ... = increment( x ) .

After some optimization, he generated: (made swap and increment valid function instead of variables and optimized the swap stack)

 template<int BytesCount> struct rawdata { //struct from some header char _[ BytesCount ]; inline char &operator[] (int index) { return _[ index ]; } }; //... rawdata<8> generatedfunction0( rawdata<8> p ) { // var swap = ( int x, y ) => { return{ p[ 4 ], p[ 5 ], p[ 6 ], p[ 7 ], p[ 0 ], p[ 1 ], p[ 2 ], p[ 3 ] }; } rawdata<4> generatedfunction1( rawdata<4> p ) { // var increment = ( int x ) => { rawdata<4> r = { p[ 0 ], p[ 1 ], p[ 2 ], p[ 3 ] }; ++*( ( int* )&r[ 0 ] ); return r; } 

I'm pretty sure ++*( ( int* )&r[ 0 ] ); won't use useless indirection, but what about return{ p[ 4 ], p[ 5 ], p[ 6 ], p[ 7 ], p[ 0 ], p[ 1 ], p[ 2 ], p[ 3 ] }; ? Is there a source that guarantees that it will optimize it as if it were two ints that were placed in an array instead of 8 or more instructions that placed bytes by byte? I am not talking only about this particular case, but nothing of the kind.

If it depends, I use GCC to compile the generated code.

+5
source share
1 answer

Yes, it can harm performances, but not always. The problem is the explicit access of individual bytes.

The intelligent compiler recognizes that you are accessing continuous memory and trying to optimize it. However, for some reason this does not work with gcc, clang or icc at all (do not test msvc). There is still room for improvement in compiler optimizers, and the IIRC standard does not require any optimization.

Exchange

So, let it handle each function, starting with swap. I added 2 more functions for completeness, see After the code snippet:

 #include <stdint.h> rawdata<8> genSWAP(rawdata<8> p) { return { p[ 4 ], p[ 5 ], p[ 6 ], p[ 7 ], p[ 0 ], p[ 1 ], p[ 2 ], p[ 3 ] }; } rawdata<8> genSWAPvar(rawdata<8> p) { return { p._[ 4 ], p._[ 5 ], p._[ 6 ], p._[ 7 ], p._[ 0 ], p._[ 1 ], p._[ 2 ], p._[ 3 ] }; } rawdata<8> genSWAP32(rawdata<8> p) { rawdata<8> res = p; uint32_t* a = (uint32_t*)&res[0]; uint32_t* b = (uint32_t*)&res[4]; uint32_t tmp = *a; *a = *b; *b = tmp; return res; } 
  • genSWAP : your function
  • genSWAPvar : the same as yours, but without using the operator[] that you defined
  • genSWAP32 : explicit packing of your bytes 32 bits to 32 bits

You can view the generated asm here .

genSWAP and genSWAPvar no different, which means that the overloaded operator[] simply optimized. However, each byte is individually accessible in memory, and is also processed individually. This is bad, since on 32-bit architectures the processor loads 4 bytes from memory at once (8 for 64-bit architectures). So, briefly gcc / clang / icc emits instructions to counter the real possibilities of 32-bit architectures ...

genSWAP32 is more efficient by performing a minimal number of downloads (for 32 bits) and using registers correctly (note that for 64-bit architectures it should be possible to make only one download instead of 2).

And finally, some real measures : on Ideone genSWAP32 almost 4 times faster (which makes sense because it has 2 loads instead of 8 and fewer computational instructions).

Increasement

Same thing, your function versus "optimized":

 rawdata<4> genINC(rawdata<4> p) { rawdata<4> r = { p[ 0 ], p[ 1 ], p[ 2 ], p[ 3 ] }; ++*( ( int* )&r[ 0 ] ); return r; } rawdata<4> genINC32(rawdata<4> p) { rawdata<4> res = p; uint32_t* a = (uint32_t*)&res[0]; ++*a; return res; } 

generated asm here .

For clang and icc, the killer is not an increment, but initialization, when you access each byte individually. gcc and icc probably do this by default, because the byte order may differ from 0 1 2 3 . Surprisingly, clang recognizes the byte order and optimizes it correctly - there is no difference.

Then something interesting happens: the genINC32 function genINC32 slower on gcc , but faster on msvc (* I don’t see the permalink button on the rise of 4fun, so go there and paste the code tested on ideone). Without seeing the generated msvc assembler and comparing, I have no explanation.

In conclusion , although it is possible that the compiler correctly optimizes all your code, do not rely on it now, so you do not need to access each byte separately if you do not need it.

+4
source

Source: https://habr.com/ru/post/1207305/


All Articles