Now there are two possible problems:
- This calculation is memory related.
- Iteration-iteration depends on
curDist .
This calculation is memory related.
Your dataset is larger than your processor cache. Therefore, in this case, no optimization will help if you cannot restructure your algorithm.
The iteration to iteration depends on curDist .
You have a dependency on curDist . This blocks vectorization by the compiler. (Also, do not always trust the line profiler numbers. They may be inaccurate, especially after optimizing the compiler.)
Typically, a compiler vectorizer can split curDist into several partial sums and expand / vectorize the loop. But he cannot do this in strict floating-point mode. You can try relaxing in floating point mode if you haven't already. Or you can split the amount and deploy it yourself.
For example, such an optimization is what the compiler can do with integers , but not necessarily with a floating point :
double curDist0 = 0.0; double curDist1 = 0.0; double curDist2 = 0.0; double curDist3 = 0.0; for (size_t i = 0; i < vecA.size() - 3; i += 4){ double dif0 = vecA[i + 0] - vecB[i + 0]; double dif1 = vecA[i + 1] - vecB[i + 1]; double dif2 = vecA[i + 2] - vecB[i + 2]; double dif3 = vecA[i + 3] - vecB[i + 3]; curDist0 += dif0 * dif0; curDist1 += dif1 * dif1; curDist2 += dif2 * dif2; curDist3 += dif3 * dif3; }
source share