C ++: Why does this speed up my code?

I have the following function

double single_channel_add(int patch_top_left_row, int patch_top_left_col, int image_hash_key, Mat* preloaded_images, int* random_values){ int first_pixel_row = patch_top_left_row + random_values[0]; int first_pixel_col = patch_top_left_col + random_values[1]; int second_pixel_row = patch_top_left_row + random_values[2]; int second_pixel_col = patch_top_left_col + random_values[3]; int channel = random_values[4]; Vec3b* first_pixel_bgr = preloaded_images[image_hash_key].ptr<Vec3b>(first_pixel_row, first_pixel_col); Vec3b* second_pixel_bgr = preloaded_images[image_hash_key].ptr<Vec3b>(second_pixel_row, second_pixel_col); return (*first_pixel_bgr)[channel] + (*second_pixel_bgr)[channel]; } 

This is called approximately one and a half million times with different values ​​for patch_top_left_row and patch_top_left_col . It takes about 2 seconds to start, now when I change the calculation of first_pixel_row, etc. Instead of using arguments instead, but instead of hard-coded numbers (shown below), the thing is executed under the second second, and I don't know why. Does the compiler do something smart here (do I use gcc cross compiler)?

 double single_channel_add(int patch_top_left_row, int patch_top_left_col, int image_hash_key, Mat* preloaded_images, int* random_values){ int first_pixel_row = 5 + random_values[0]; int first_pixel_col = 6 + random_values[1]; int second_pixel_row = 8 + random_values[2]; int second_pixel_col = 10 + random_values[3]; int channel = random_values[4]; Vec3b* first_pixel_bgr = preloaded_images[image_hash_key].ptr<Vec3b>(first_pixel_row, first_pixel_col); Vec3b* second_pixel_bgr = preloaded_images[image_hash_key].ptr<Vec3b>(second_pixel_row, second_pixel_col); return (*first_pixel_bgr)[channel] + (*second_pixel_bgr)[channel]; } 

EDIT:

I inserted the assembly of two versions of the function using the arguments: http://pastebin.com/tpCi8c0F using the constants: http://pastebin.com/bV0d7QH7

EDIT:

After compiling with -O3, I get the following ticks and speeds:

using arguments: 1990000 ticks and 1.99 seconds using constants: 330,000 ticks and 0.33 seconds.

EDIT: using argumenst with -03 compilation: http://pastebin.com/fW2HCnHc using constant with -03 compilation: http://pastebin.com/FHs68Agi

+4
source share
2 answers

There are instructions on the x86 platform that very quickly add small numbers to the register. These instructions are lea (aka 'load effective address') instructions, and they are designed to calculate address offsets for structures, etc. A small integer is added, in fact, is part of the instruction. Smart compilers know that these instructions are very fast and use them to add, even if the addresses are not involved.

I bet if you changed the constants to some random value, which was at least 24 bits, that you will see that most of the acceleration will disappear.

Secondly, these constants are known values. The compiler can do a lot to ensure that these values ​​appear in the register in the most efficient way. With an argument, if the argument is not passed to the register (and I think your function has too many arguments for this calling convention to be used), the compiler has no choice but to retrieve the number from memory using the load stack offset command. This is not a very slow instruction or anything else, but with constants the compiler can do something much faster than just extracting a number from the instruction itself. lea instructions are the most extreme example of this.

Edit: now that you have inserted the assembly, everything is much clearer

In inconsistent code, here's how to add:

 addl -68(%rbp), %eax 

This fetches the value offset -68(%rpb) from the stack and adds it to the %eax% register.

In the constant code, here is how to add:

 addl $5, %eax 

and if you look at the actual numbers, you will see the following:

 0138 83C005 

It is quite clear that the added constant is encoded directly into the instruction as a small value. This will be much faster than fetching from the stack offset for a number of reasons. At first it's less. Secondly, it is part of a command flow without branches. Thus, it will be preloaded and conveyed, without the possibility of using stalls of any type.

So while my discussions about lea instruction were wrong, I was still on the right track. The permanent version uses a small instruction specifically focused on adding a small integer to the register. The non-constant version should extract an integer, which can have an indefinite size (therefore, it should extract ALL bits, not just low ones) from the stack offset (which adds to the optional add to calculate the actual address from the offset and the base address of the stack).

Edit 2: Now that you have sent -O3 results

Well, now this is much more confusing. This seems to be related to the function, and it jumps around a ton between the code for the inline function and the code for the calling function. I will need to see the source code for the whole file in order to do the correct analysis.

But now I suspect that the unpredictability of values ​​obtained from get_random_number_in_range greatly limits the optimization options available to the compiler. In fact, it seems that in the constant version he doesn't even bother calling get_random_number_in_range because the value is selected and never used.


I assume that the values ​​of patch_top_left_row and patch_top_left_col generated in a loop somewhere. I would include this loop in this function. If the compiler knows that values ​​are generated as part of a loop, there are a large number of optimization options open to it. As a last resort, it can use some SIMD instructions that are part of various SSE or 3dnow! sets of commands to make things an entire ton faster than even the version you use that uses constants.

Another option is to make this function inline, which will tell the compiler that he should try to insert it into the loop in which he called. If the compiler accepts the hint (this function is a bit more complicated, so the compiler may not work), it will have the same effect as if you filled the loop with a function.

+5
source

Well, it is expected that binary arithmetic operations of the format immediate constant vs. memory immediate constant vs. memory will generate faster code than memory vs. memory memory vs. memory , but the observed synchronization effect seems too extreme, especially considering that there are other operations inside this function.

Maybe the compiler decided to include your function? Inlining will allow the compiler to easily eliminate everything related to the unused parameters patch_top_left_row and patch_top_left_col in the second version, including any steps that prepare / calculate these parameters in the calling code.

Technically, this can be done even if the function is not built-in, but is usually more complicated.

+2
source

Source: https://habr.com/ru/post/1443712/


All Articles