Headache Optimization - Removing If From Look Up Table

I am trying to optimize the following piece of code, which is the bottleneck in my application. What he does: he takes the double values ​​of value1 and value2 and tries to find the maximum, including the correction factor. If the difference between both values ​​is greater than 5.0 (the LUT scales 10 times), I can simply take the maximum value of these two. If the difference is less than 5.0, I can use the correction factor from the LUT.

Does anyone have an idea what could be the best style for this piece of code? I don’t know where I am wasting time - is it a large number of ifs or a multiplication by 10?

double value1, value2; // Lookup Table scaled by 10 for (ln(1+exp(-abs(x)))), which is almost 0 for x > 5 and symmetrical around 0. LUT[0] is x=0.0, LUT[40] is x=4.0. const logValue LUT[50] = { ... } if (value1 > value2) { if (value1 - value2 >= 5.0) { return value1; } else { return value1 + LUT[(uint8)((value1 - value2) * 10)]; } } else { if (value2 - value1 >= 5.0) { return value2; } else { return value2 + LUT[(uint8)((value2 - value1) * 10)]; } } 
+6
source share
7 answers

A few minutes of playing with Excel creates an approximation equation, similar to the fact that it can have the accuracy you need, so you can completely abandon the search table. You still need one condition to make sure that the parameters for the equation remain within the range for which it was optimized.

 double diff = abs(value1 - value2); double dmax = (value1 + value2 + diff) * 0.5; // same as (min+max+(max-min))/2 if (diff > 5.0) return dmax; return dmax + 4.473865638/(2.611112371+diff) + 0.088190879*diff + -1.015046114; 

PS I do not guarantee that this happens faster, just because it is an approach that is suitable enough for comparative analysis.

PPS You can change the restrictions to come up with slightly different constants, there are many variations. Here's another set I made, where the difference between your table and formula will always be less than 0.008, also each value will be less than the previous one.

 return dmax + 3.441318133/(2.296924445+diff) + 0.065529678*diff + -0.797081529; 

Edit: I tested this code (second formula) with 100 passes against a million random numbers from 0 to 10 along with the source code from the question, MSalters, currently accepted answer , and brute force implementation max(value1,value2)+log(1.0+exp(-abs(value1-value2))) . I tried it on an AMD Athlon dual-core processor and an Intel i7 quad-core processor, and the results were roughly consistent. Here's a typical run:

  • Original: 1.32 seconds.
  • MSalters: 1.13 seconds.
  • Mine: 0.67 seconds.
  • Brute force: 4.50 seconds.

Processors have become incredibly fast over the years, so fast that they can perform several floating point operations and divide faster than they can find the value in memory. This modern approach is not only accelerated on modern x86, but also more accurate; approximation errors in the equation are much smaller than step errors caused by truncating the input for the search.

Naturally, the results may still vary depending on your processor and compiler; benchmarking is still needed for your specific purpose.

+2
source

It probably goes along both tracks the same way that you cause a lot of handset problems for your processor.

Have you tried profiling?

I also suggest trying to use the standard library and see if this helps (for example, if it is able to use instructions for the processor):

 double diff = std::fabs(value1 - value2); double maxv = std::max(value1, value2); return (diff >= 5.0) ? maxv : maxv + LUT[(uint8)((diff) * 10)]; 
+2
source

I would probably write code slightly different from the value2<value1 :

 if (value2 < value1) std::swap(value1, value2); assert(value1 <= value2); // Assertion corrected int diff = int((value2 - value1) * 10.0); if (diff >= 50) diff = 49; // Integer comparison iso floating point return value2 + LUT[diff]; 
+2
source

I'm going to assume when the function is called, you are likely to get the part where you have to use the lookup table, not the >=5.0 parts. In this case, it is better to orient the compiler to this.

 double maxval = value1; double difference_scaled = (value1-value2)*10; if (difference < 0) { difference = -difference; maxval = value2; } if (difference < 50) return maxval+LUT[(int)difference_scaled]; else return maxval; 

Try this and let me know if your program will improve performance.

+1
source

The only reason this code is a bottleneck in your application is because you call it many times. Are you sure what you need? Perhaps the algorithm above in the code can be modified to use less comparison?

0
source

You compute value1 - value2 several times in your function. Just do it once.

What applies to uint8_t can also be problematic. As far as performance is concerned, the best integral type to use as a conversion from double to integer is int , since the best integral type to use an array index is int .

 max_value = value1; diff = value1 - value2; if (diff < 0.0) { max_value = value2; diff = -diff; } if (diff >= 5.0) { return max_value; } else { return max_value + LUT[(int)(diff * 10.0)]; } 

Please note that the above ensures that the LUT index is between 0 (inclusive) and 50 (excluding). There is no need for uint8_t .

Edit
After some games with some variations, this is a fairly quick approximation of LUT to log(exp(value1)+exp(value2)) :

 #include <stdint.h> // intptr_t *happens* to be fastest on my machine. YMMV. typedef intptr_t IndexType; double log_sum_exp (double value1, double value2, double *LUT) { double diff = value1 - value2; if (diff < 0.0) { value1 = value2; diff = -diff; } IndexType idx = diff * 10.0; if (idx < 50) { value1 += LUT[idx]; } return value1; } 

The integral type IndexType is one of the keys to speeding up the process. I tested with clang and g ++, and both indicated that casting to intptr_t ( long on my computer) and using intptr_t as an index in LUT is faster than other integral types. This is significantly faster than some types. For example, unsigned long long and uint8_t incredibly bad options on my computer.

Type is not just a hint, at least with the compilers used. These compilers did exactly what he said in the code to convert from a floating point type to an integral type, regardless of the level of optimization.

Another speed error is obtained from comparing an integral type with 50, as opposed to comparing a floating point type with 5.0.

One final speed error: not all compilers are created equal. On my computer (YMMV) g++ -O3 generates significantly slower code (25% slower with this problem!) Than clang -O3 , which in turn generates code that is slightly slower than the generated clang -O4 .

I also played with a rational function approximation (similar to Mark Ransom's answer), but the above obviously does not use this approach.

0
source

I did very quick tests, but please comment on the code yourself to check the effect.

Changing LUT[] to a static variable allowed me to increase the speed by 600% (from 3.5 to 0.6 s). Which is close to the absolute minimum of the test that I used (0.4 s). See if this and reprofiling work to determine if further optimization is needed.

For reference, I just set the runtime of this loop (100 million iterations of the inner loop) in VC ++ 2010:

 int Counter = 0; for (double j = 0; j < 10; j += 0.001) { for (double i = 0; i < 10; i += 0.001) { ++Counter; Value1 += TestFunc1(i, j); } } 
0
source

Source: https://habr.com/ru/post/897145/


All Articles