Double Subtraction Optimization in C ++

I have the following code that I use to calculate the distance between two vectors:

double dist(vector<double> & vecA, vector<double> & vecB){ double curDist = 0.0; for (size_t i = 0; i < vecA.size(); i++){ double dif = vecA[i] - vecB[i]; curDist += dif * dif; } return curDist; } 

This feature is the main bottleneck in my application because it relies on a lot of distance calculations, consuming more than 60% of the processor time on a typical input. In addition, the following line:

 double dif = vecA[i] - vecB[i]; 

responsible for more than 77% of the processor time in this function. My question is: is it possible to somehow optimize this function?

Notes:

  • To profile my application, I used Intel Amplifier XE;
  • Reducing the number of distance calculations is not a possible solution for me;
+4
source share
4 answers

Now there are two possible problems:

  • This calculation is memory related.
  • Iteration-iteration depends on curDist .

This calculation is memory related.

Your dataset is larger than your processor cache. Therefore, in this case, no optimization will help if you cannot restructure your algorithm.


The iteration to iteration depends on curDist .

You have a dependency on curDist . This blocks vectorization by the compiler. (Also, do not always trust the line profiler numbers. They may be inaccurate, especially after optimizing the compiler.)

Typically, a compiler vectorizer can split curDist into several partial sums and expand / vectorize the loop. But he cannot do this in strict floating-point mode. You can try relaxing in floating point mode if you haven't already. Or you can split the amount and deploy it yourself.

For example, such an optimization is what the compiler can do with integers , but not necessarily with a floating point :

 double curDist0 = 0.0; double curDist1 = 0.0; double curDist2 = 0.0; double curDist3 = 0.0; for (size_t i = 0; i < vecA.size() - 3; i += 4){ double dif0 = vecA[i + 0] - vecB[i + 0]; double dif1 = vecA[i + 1] - vecB[i + 1]; double dif2 = vecA[i + 2] - vecB[i + 2]; double dif3 = vecA[i + 3] - vecB[i + 3]; curDist0 += dif0 * dif0; curDist1 += dif1 * dif1; curDist2 += dif2 * dif2; curDist3 += dif3 * dif3; } // Do some sort of cleanup in case (vecA.size() % 4 != 0) double curDist = curDist0 + curDist1 + curDist2 + curDist3; 
+5
source

You can eliminate the call to vecA.size() for each iteration of the loop, just call it once before the loop. You could also expand the loop to give yourself more computation per iteration of the loop. Which compiler do you use and what optimization settings? The compiler will often perform a U-turn for you, but you can do it manually.

+3
source

If this is possible (if the range of numbers is not huge), you can explore using a fixed point to save these numbers, rather than double.

A fixed point will turn them into int operations, not double operations.

Another interesting thing is that if your profile is correct, the search looks like a significant factor (otherwise, multiplication is likely to be more expensive than subtracting).

I would try using the constant vector const, rather than random access search. This can help in two ways: 1 - the constant, and 2 - the consistent nature of the iterator, allowing the processor to cache better.

+2
source

If your platform does not have (or does not use) an ALU that supports floating point math, floating point libraries are inherently slow and consume additional non-volatile memory. Instead, I suggest using 32-bit ( long ) or 64-bit ( long long ) fixed-point arithmetic. Then convert the final result to a floating point at the end of the algorithm. I did this in a project a couple of years ago to improve the performance of the I2T algorithm, and it worked great.

0
source

Source: https://habr.com/ru/post/1396479/


All Articles