Fast Gaussian Blur Image Filter with ARM NEON

I am trying to create a mobile fast version of the Gaussian Blur image filter.

I read other questions, for example: Quick blurring of an unsigned Gaussian image char image-ARM Neon Intrinsics-iOS Dev

For my purpose, I only need a fixed gamma filter with a fixed size (7x7) (2).

So, before optimizing for ARM NEON, I implement a one-dimensional Gaussian kernel in C ++ and compare the performance with the OpenCV GaussianBlur () method directly in the mobile environment (Android with NDK). Thus, this will lead to a significantly simplified code for optimization.

However, the result is that my implementation is 10 times slower than the version of OpenCV4Android. I read that OpenCV4 Tegra optimized the implementation of GaussianBlur, but I don't think the standard OpenCV4Android has such optimizations, so why is my code so slow?

Here is my implementation (note: reflect101 is used to reflect pixels when applying a filter near the borders):

Mat myGaussianBlur(Mat src){ Mat dst(src.rows, src.cols, CV_8UC1); Mat temp(src.rows, src.cols, CV_8UC1); float sum, x1, y1; // coefficients of 1D gaussian kernel with sigma = 2 double coeffs[] = {0.06475879783, 0.1209853623, 0.1760326634, 0.1994711402, 0.1760326634, 0.1209853623, 0.06475879783}; //Normalize coeffs float coeffs_sum = 0.9230247873f; for (int i = 0; i < 7; i++){ coeffs[i] /= coeffs_sum; } // filter vertically for(int y = 0; y < src.rows; y++){ for(int x = 0; x < src.cols; x++){ sum = 0.0; for(int i = -3; i <= 3; i++){ y1 = reflect101(src.rows, y - i); sum += coeffs[i + 3]*src.at<uchar>(y1, x); } temp.at<uchar>(y,x) = sum; } } // filter horizontally for(int y = 0; y < src.rows; y++){ for(int x = 0; x < src.cols; x++){ sum = 0.0; for(int i = -3; i <= 3; i++){ x1 = reflect101(src.rows, x - i); sum += coeffs[i + 3]*temp.at<uchar>(y, x1); } dst.at<uchar>(y,x) = sum; } } return dst; } 
+4
source share
4 answers

This is the code after implementing all the @Paul R and @ sh1 sentences summarized as follows:

1) use only integer arithmetic (accurate to taste)

2) add pixel values ​​at the same distance from the center of the mask before applying multiplications to reduce the number of multiplications.

3) apply only horizontal filters to take advantage of the storage by matrix rows

4) separate loops around the edges from those inside the image so as not to make unnecessary calls to the reflection functions. I completely removed the reflection functions, including them inside the loops at the edges.

5) In addition, as a personal observation, to improve rounding without calling the (round) function of "round" or "cvRound", I added both temporary and final results of 0.5f pixels (= 32768 at integer points) to reduce margin of error / difference compared to OpenCV.

Now performance is much better, about 15 to about 6 times slower than OpenCV.

However, the resulting matrix is ​​not completely identical to that obtained with Gaussian blur OpenCV. This is not due to the arithmetic length (sufficient), but to the removal of the error. Please note that this is the minimum difference between 0 and 2 (in absolute value) of the pixel intensity between matrices obtained from two versions. The coefficient is the same as OpenCV obtained using getGaussianKernel with the same size and sigma.

 Mat myGaussianBlur(Mat src){ Mat dst(src.rows, src.cols, CV_8UC1); Mat temp(src.rows, src.cols, CV_8UC1); int sum; int x1; double coeffs[] = {0.070159, 0.131075, 0.190713, 0.216106, 0.190713, 0.131075, 0.070159}; int coeffs_i[7] = { 0 }; for (int i = 0; i < 7; i++){ coeffs_i[i] = (int)(coeffs[i] * 65536); //65536 } // filter horizontally - inside the image for(int y = 0; y < src.rows; y++){ uchar *ptr = src.ptr<uchar>(y); for(int x = 3; x < (src.cols - 3); x++){ sum = ptr[x] * coeffs_i[3]; for(int i = -3; i < 0; i++){ int tmp = ptr[x+i] + ptr[xi]; sum += coeffs_i[i + 3]*tmp; } temp.at<uchar>(y,x) = (sum + 32768) / 65536; } } // filter horizontally - edges - needs reflect for(int y = 0; y < src.rows; y++){ uchar *ptr = src.ptr<uchar>(y); for(int x = 0; x <= 2; x++){ sum = 0; for(int i = -3; i <= 3; i++){ x1 = x + i; if(x1 < 0){ x1 = -x1; } sum += coeffs_i[i + 3]*ptr[x1]; } temp.at<uchar>(y,x) = (sum + 32768) / 65536; } } for(int y = 0; y < src.rows; y++){ uchar *ptr = src.ptr<uchar>(y); for(int x = (src.cols - 3); x < src.cols; x++){ sum = 0; for(int i = -3; i <= 3; i++){ x1 = x + i; if(x1 >= src.cols){ x1 = 2*src.cols - x1 - 2; } sum += coeffs_i[i + 3]*ptr[x1]; } temp.at<uchar>(y,x) = (sum + 32768) / 65536; } } // transpose to apply again horizontal filter - better cache data locality transpose(temp, temp); // filter horizontally - inside the image for(int y = 0; y < src.rows; y++){ uchar *ptr = temp.ptr<uchar>(y); for(int x = 3; x < (src.cols - 3); x++){ sum = ptr[x] * coeffs_i[3]; for(int i = -3; i < 0; i++){ int tmp = ptr[x+i] + ptr[xi]; sum += coeffs_i[i + 3]*tmp; } dst.at<uchar>(y,x) = (sum + 32768) / 65536; } } // filter horizontally - edges - needs reflect for(int y = 0; y < src.rows; y++){ uchar *ptr = temp.ptr<uchar>(y); for(int x = 0; x <= 2; x++){ sum = 0; for(int i = -3; i <= 3; i++){ x1 = x + i; if(x1 < 0){ x1 = -x1; } sum += coeffs_i[i + 3]*ptr[x1]; } dst.at<uchar>(y,x) = (sum + 32768) / 65536; } } for(int y = 0; y < src.rows; y++){ uchar *ptr = temp.ptr<uchar>(y); for(int x = (src.cols - 3); x < src.cols; x++){ sum = 0; for(int i = -3; i <= 3; i++){ x1 = x + i; if(x1 >= src.cols){ x1 = 2*src.cols - x1 - 2; } sum += coeffs_i[i + 3]*ptr[x1]; } dst.at<uchar>(y,x) = (sum + 32768) / 65536; } } transpose(dst, dst); return dst; } 
+2
source

Most of the problem here is that the algorithm is too accurate, as @PaulR pointed out. It’s usually best to keep the odds table no more accurate than your data. In this case, since you seem to be processing uchar data, you should use an approximately 8-bit coefficient table.

Keeping these weights will be of particular importance in your NEON implementation, because the less arithmetic you have, the more lanes you can process right away.

In addition, the first significant slowdown that stands out is that it has a reflection code for the edges of the image in the main loop. This will make most of the work less efficient, because in this case you don’t need to do anything special at all.

This might work better if you use a special version of the loop near the edges, and then when you are safe from this, you use a simplified inner loop that does not call this function reflect101() .

The second (more relevant for the prototype code) is that you can add the wings of the window together before applying the weighting function, since the table contains the same coefficients on both sides.

 sum = src.at<uchar>(y1, x) * coeffs[3]; for(int i = -3; i < 0; i++) { int tmp = src.at<uchar>(y + i, x) + src.at<uchar>(y - i, x); sum += coeffs[i + 3] * tmp; } 

This saves you six multiplications per pixel, and this is a step towards some other optimizations around overflow control.

Then there are a few more problems associated with the memory system.

The two-pass approach is good in principle because it saves you from doing a lot of recalculation. Unfortunately, it can pull useful data from the L1 cache, which can make everything a lot slower. It also means that when you write the result into memory, you quantize the subtotal, which can reduce accuracy.

When you convert this code to NEON, one of the things you want to focus on is trying to keep your working set inside the registration file, but without dropping the calculations before they are fully used.

When people use two passes, it is usually necessary to transpose for intermediate data, i.e. the input column becomes the output line.

This is because the CPU really does not want to receive small amounts of data over several lines of the input image. It works much more efficiently (due to how the cache works) if you put together a bunch of horizontal pixels and filter them. If the temporary buffer is transposed, then the second pass also collects a bunch of horizontal points together (which will be vertical in the original orientation), and it again transfers its output so that it leaves the correct path.

If you optimize the localization of your working set, you may not need this transposition trick, but it is worth knowing that you can set yourself a good basic performance. Unfortunately, localization like this forces you to return to suboptimal memory samples, but with wider data types this limitation can be mitigated.

+6
source

If this is specifically for 8-bit images, you really don't need floating-point coefficients, especially not double precision. Also you do not want to use float for x1, y1. You just need to use integers for the coordinates, and you can use a fixed point (i.e., an Integer) so that the coefficients keep all the filter arithmetic in an integer area, for example.

 Mat myGaussianBlur(Mat src){ Mat dst(src.rows, src.cols, CV_8UC1); Mat temp(src.rows, src.cols, CV_16UC1); // <<< int sum, x1, y1; // <<< // coefficients of 1D gaussian kernel with sigma = 2 double coeffs[] = {0.06475879783, 0.1209853623, 0.1760326634, 0.1994711402, 0.1760326634, 0.1209853623, 0.06475879783}; int coeffs_i[7] = { 0 }; // <<< //Normalize coeffs float coeffs_sum = 0.9230247873f; for (int i = 0; i < 7; i++){ coeffs_i[i] = (int)(coeffs[i] / coeffs_sum * 256); // <<< } // filter vertically for(int y = 0; y < src.rows; y++){ for(int x = 0; x < src.cols; x++){ sum = 0; // <<< for(int i = -3; i <= 3; i++){ y1 = reflect101(src.rows, y - i); sum += coeffs_i[i + 3]*src.at<uchar>(y1, x); // <<< } temp.at<uchar>(y,x) = sum; } } // filter horizontally for(int y = 0; y < src.rows; y++){ for(int x = 0; x < src.cols; x++){ sum = 0; // <<< for(int i = -3; i <= 3; i++){ x1 = reflect101(src.rows, x - i); sum += coeffs_i[i + 3]*temp.at<uchar>(y, x1); // <<< } dst.at<uchar>(y,x) = sum / (256 * 256); // <<< } } return dst; } 
+2
source

According to a Google doc, on an Android device, using float / double is twice as slow as using int / uchar.

You can find some solutions to speed up your C ++ code in these Android docs. https://developer.android.com/training/articles/perf-tips

0
source

Source: https://habr.com/ru/post/1489873/


All Articles