Fast Gaussian blur on an unsigned char image - ARM Neon Intrinsics - iOS Dev

Can someone tell me a quick function to find a Gaussian blur of an image using a 5x5 mask. I need this for an iOS app. I am directly working on a memory designated as

unsigned char *image_sqr_Baseaaddr = (unsigned char *) malloc(noOfPixels); for (row = 2; row < H-2; row++) { for (col = 2; col < W-2; col++) { newPixel = 0; for (rowOffset=-2; rowOffset<=2; rowOffset++) { for (colOffset=-2; colOffset<=2; colOffset++) { rowTotal = row + rowOffset; colTotal = col + colOffset; iOffset = (unsigned long)(rowTotal*W + colTotal); newPixel += (*(imgData + iOffset)) * gaussianMask[2 + rowOffset][2 + colOffset]; } } i = (unsigned long)(row*W + col); *(imgData + i) = newPixel / 159; } } 

This is obviously the slowest feature. I heard that ARM Neon intrinsics on iOS can be used to perform several operations in 1 cycle. Maybe this is the way to go?

The problem is that I am not very familiar and I don’t have time to learn assembly language. So it would be great if someone could post a neon code for an internal problem for the problem mentioned above or any other fast implementation in C / C ++.

+1
source share
2 answers

Before moving on to optimizing SIMD with NEON, you must first improve your scalar implementation. The biggest problem with your code is that it is implemented as if it were an inseparable filter, while the Gaussian kernel is separable. Switching to a shared implementation, you reduce the number of operations from N ^ 2 to 2N, which in your case with a 5x5 core will be a decrease from 25 multiplied additions to 10, that is, with 2.5 times acceleration for very small efforts.

Perhaps a sufficiently optimized scalar implementation will satisfy your needs without having to resort to SIMD. If not, then you can at least transfer these scalar optimizations to a vectorized implementation.


http://en.wikipedia.org/wiki/Gaussian_blur

http://blogs.mathworks.com/steve/2006/11/28/separable-convolution-part-2/

+5
source
  • Separate your core as described by Paul R.
  • Do not reinvent the wheel. Use vImage, which is part of the Accelerate framework, and implements vectorized multithreaded convolution for you. In particular, it seems that you need the vImageConvolve_Planar8 function.
+4
source

Source: https://habr.com/ru/post/1489875/


All Articles