Most of the problem here is that the algorithm is too accurate, as @PaulR pointed out. Itβs usually best to keep the odds table no more accurate than your data. In this case, since you seem to be processing uchar data, you should use an approximately 8-bit coefficient table.
Keeping these weights will be of particular importance in your NEON implementation, because the less arithmetic you have, the more lanes you can process right away.
In addition, the first significant slowdown that stands out is that it has a reflection code for the edges of the image in the main loop. This will make most of the work less efficient, because in this case you donβt need to do anything special at all.
This might work better if you use a special version of the loop near the edges, and then when you are safe from this, you use a simplified inner loop that does not call this function reflect101() .
The second (more relevant for the prototype code) is that you can add the wings of the window together before applying the weighting function, since the table contains the same coefficients on both sides.
sum = src.at<uchar>(y1, x) * coeffs[3]; for(int i = -3; i < 0; i++) { int tmp = src.at<uchar>(y + i, x) + src.at<uchar>(y - i, x); sum += coeffs[i + 3] * tmp; }
This saves you six multiplications per pixel, and this is a step towards some other optimizations around overflow control.
Then there are a few more problems associated with the memory system.
The two-pass approach is good in principle because it saves you from doing a lot of recalculation. Unfortunately, it can pull useful data from the L1 cache, which can make everything a lot slower. It also means that when you write the result into memory, you quantize the subtotal, which can reduce accuracy.
When you convert this code to NEON, one of the things you want to focus on is trying to keep your working set inside the registration file, but without dropping the calculations before they are fully used.
When people use two passes, it is usually necessary to transpose for intermediate data, i.e. the input column becomes the output line.
This is because the CPU really does not want to receive small amounts of data over several lines of the input image. It works much more efficiently (due to how the cache works) if you put together a bunch of horizontal pixels and filter them. If the temporary buffer is transposed, then the second pass also collects a bunch of horizontal points together (which will be vertical in the original orientation), and it again transfers its output so that it leaves the correct path.
If you optimize the localization of your working set, you may not need this transposition trick, but it is worth knowing that you can set yourself a good basic performance. Unfortunately, localization like this forces you to return to suboptimal memory samples, but with wider data types this limitation can be mitigated.