Why does kernel convolution work?

I don’t understand how someone could come up with a simple 3x3 matrix called the core, so when it is applied to the image, this will create some amazing effect. Examples: http://en.wikipedia.org/wiki/Kernel_(image_processing) . Why does it work? How did people come up with these kernels (trial and error?)? Is it possible to prove that it will always work for all images?

+6
source share
5 answers

Intuitively, convolution of the image I with the kernel K creates a new image, which is formed by calculating the weighted sum for each pixel of all neighboring pixels, weighted by weight in K. Even if you do not know what convolution is, this idea still seems pretty reasonable. You can use it for a blur effect (using Gaussian weighting of neighboring pixels) or for sharpening edges (by subtracting each pixel from its neighbors and without extra weight). In fact, if you knew that you needed to do all these operations, it would be advisable to try to write a function that would give me and K a weighted sum of neighboring pixels, and try to optimize this function as aggressively as possible (since you probably used a lot).

To move from idea to convolution, you probably need a background in the Fourier transforms and Fourier series. Convolations are an absolutely natural idea in this area - if you compute the Fourier transform of two images and multiply the transforms together, you will eventually compute the convolution transform. Mathematicians worked it out some time ago, possibly answering the most natural question: “What function has the Fourier transform defined by the product of two other Fourier transforms?”, And from there it was just a matter of time before the connection was found, since the Fourier transforms are already wide are used in the calculation (for example, when processing signals in networks), I assume that someone from the background in the Fourier series noticed that they need to apply the K kernel to image I, and then recognize that it is simpler and more efficient If you make it in frequency space.

I honestly have no idea what a real story is, but that’s a pretty plausible explanation.

Hope this helps!

+7
source

I don’t understand how someone could come up with a simple 3x3 matrix called the core, so when applied to the image it would have some amazing effect. Examples: http://en.wikipedia.org/wiki/Kernel_(image_processing) .

If you want to delve into the story, you will need to check out some other conditions. In old image processing textbooks, what we call kernels today will most likely be called operators . "Another key term is convolution . Both of these terms hint at the mathematical basis of the nuclei.

http://en.wikipedia.org/wiki/Convolution

You can read about mathematical convolution in the textbook Computer Vision by Ballard and Brown. The book dates back to the early 80s, but it is still very useful, and you can read it for free online:

http://homepages.inf.ed.ac.uk/rbf/BOOKS/BANDB/toc.htm

From the table of contents in Ballard and Brown, you will find a PDF link for section 2.2.4 Spatial Properties.

http://homepages.inf.ed.ac.uk/rbf/BOOKS/BANDB/LIB/bandb2_2.pdf

In the PDF file, scroll down to the "Convolution Theorem" section. This provides a mathematical background for convolution. This is a relatively short step from thinking about convolution, expressed as functions and integrals, to applying the same principles to the discrete world of shades of gray (or color) data in 2D images.

You will notice that a number of cores / operators are associated with names: Sobel, Prewitt, Laplacian, Gaussian, etc. These names help to suggest that there is a history - a rather long history - of the mathematical development and research of image processing, which led to the large number of cores that are widely used today.

Gauss and Laplace lived long before us, but their mathematical work leaked into forms that we can use in image processing. They did not work on image processing cores, but the developed mathematical methods are directly applicable and are usually used in image processing. Other cores have been developed specifically for image processing.

The Prewitt operator (kernel), which is very similar to the Sobel operator, was published in 1970 if Wikipedia is correct.

http://en.wikipedia.org/wiki/Prewitt_operator

Why does it work?

Read about mathematical convolution theory to understand how one function can be “passed” or “dragged” through another. This may explain the theoretical basis.

Then the question arises as to why individual kernels work. In this case, you look at the transition of the edge from dark to light in the image, and if you plot the brightness of the pixels on a two-dimensional scattering diagram, you will notice that the values ​​along the Y axis grow rapidly relative to the transition to the edge of the image. This edge transition is a slope. The slope can be found using the first derivative. TA-dah! A kernel that approximates the first derived operator will find edges.

If you know that in optics there is a Gaussian blur, then you might think how this can be applied to a two-dimensional image. Thus, the conclusion of the Gaussian core.

Laplacian, for example, is an operator that, according to the first sentence from Wikipedia, "is a differential operator given by the divergence of the gradient of a function in Euclidean space."

http://en.wikipedia.org/wiki/Laplacian

Boy. This is quite a leap from this definition to the core. The next page perfectly describes the explanation of the relationship between derivatives and nuclei, and reads quickly:

http://www.aishack.in/2011/04/the-sobel-and-laplacian-edge-detectors/

You will also see that one of the forms of the Laplacian kernel is simply called the "edge search" kernel in the Wikipedia entry you specified.

There is more than one edge core, each of which has its own place. The nuclei of Laplacian, Sobel, Previtt, Kirsch and Roberts give different results and are suitable for different purposes.

How did people come up with these kernels (trial and error?)?

The cores were developed by different people in a number of areas of research.

Some cores (in my memory) were designed specifically to simulate the "early vision" process. Early vision is not something that happens only to early people, or only to people who get up at 4 a.m., but instead turn to low-level processes of biological vision: a sense of the primary color, intensity, edges, and the like. At a very low level, edge detection in biological vision can be modeled using nuclei.

Other kernels, such as Laplacian and Gaussian, are approximations of mathematical functions. With a little effort, you can get the cores yourself.

Image processing and image processing software packages often allow you to define your own kernel. For example, if you want to identify a figure in an image small enough to be defined by a few connected pixels, then you can define a kernel that matches the shape of the image function that you want to detect. Using custom kernels to detect objects is too rude to work in most real applications, but sometimes there are reasons to create a special kernel for a very specific purpose, and sometimes a little trial and correction is required to find a good kernel.

As the user templatetypedef pointed out, you can think of kernels intuitively, and in a fairly short time think about what everyone will do.

Is it possible to prove that it will always work for all images?

Functionally, you can throw a 3x3, 5x5 or NxN core onto an image of the appropriate size and it will “work” in the sense that the operation will be completed and there will be some result. But then the ability to calculate the result, whether it is useful or not, is not an excellent definition of "work".

One informational definition of whether the kernel works is whether the crushing image with this kernel causes a result that you find useful. If you manipulate images in Photoshop or GIMP, and if you find that a particular improvement kernel does not give what you need, you can say that the kernel does not work in the context of your specific image and the final result you want. There is a similar problem in image processing for computer vision: we must choose one or more cores and other (often not based on the core) algorithms that will act sequentially to do something useful, for example, to identify faces, measure the speed of cars or manual robots in assembly tasks.

Homework

If you want to understand how you can translate a mathematical concept into a kernel, this will help to bring out the kernel yourself. Even if you know what the final result of the output should be in order to align the concept of kernels and convolution, it helps to extract the kernel from a mathematical function independently, on paper and (preferably) from memory.

Try to derive a 3x3 Gaussian kernel from a mathematical function.

http://en.wikipedia.org/wiki/Gaussian_function

Getting the kernel on your own, or at least looking for an online tutorial and reading it carefully, will be very revealing. If you prefer not to do the work, then you may not appreciate how some mathematical expression is “translated” into a bunch of numbers in a 3x3 matrix. But it normal! If you get the general meaning of a common kernel, it is useful, and if you notice how two similar kernels produce slightly different results, then you will have a good feeling for them.

+9
source

There is a lot of mathematical theory about convolutions, but the kernel examples you refer to are simply explained intuitively:

[ 0 0 0] [ 0 1 0] [ 0 0 0] 

This says taking the original pixel and nothing else, so it only gives the original image.

 [-1 -1 -1] [-1 8 -1] [-1 -1 -1] 

This suggests subtracting eight neighbors from an eight-fold original pixel. First, consider what happens in the smooth part of the image, where there is a solid, unchanging color. Eight times, the original pixel is equal to the sum of eight identical neighbors, so the difference is zero. Thus, the smooth parts of the image turn black. However, portions of the images where changes occur do not turn black. Thus, this core highlights the changes, therefore it highlights the places where one shape ends and the other begins: the edges of the objects in the image.

 [ 0 1 0] [ 1 -4 1] [ 0 1 0] 

This is similar to the one above, but it is configured differently.

 [ 0 -1 0] [-1 5 -1] [0 -1 0] 

Note that this is just the negation of the edge detector above plus the first filter we saw, the one that was for the original image. Thus, this core selects the edges and adds them to the original image. The result is an original image with more visible edges: sharpening effect.

 [ 1 2 1] [ 2 4 2] [ 1 2 1] [ 1 1 1] [ 1 1 1] [ 1 1 1] 

Both of them mix the original pixel with their neighbors. Therefore, they blur the image a bit.

+7
source

There are two ways of thinking (or coding) an image: a spatial domain and a frequency domain. The spatial representation is pixel-based, so it is more familiar and easier to obtain. Both the image and the core are expressed in the spatial domain.

To get into the frequency domain, you need to use Fourier or its associated transformation, which is an expensive computational one. However, when you are there, many interesting manipulations are easier. To blur the image, you can simply chop off some high-frequency parts - for example, crop the image in a spatial domain. Sharpness is the opposite, similar to increasing the contrast of high-frequency information.

Most of the image information is at high frequencies, which are details. The most interesting detailed information is on a small local scale. You can do a lot by looking at neighboring pixels. Blur basically takes the weighted average of neighboring pixels. Sharpness is to look at the difference between a pixel and its neighbors and enhance the contrast.

The core is usually obtained by transforming the frequency domain, then storing only the high-frequency part and expressing it in the spatial domain. This can only be done for certain conversion algorithms. You can figure out the ideal kernel for blurring, sharpening, selecting certain types of lines, etc., And it will work intuitively, but otherwise it looks like magic, because we don’t actually have “pixel arithmetic”.

Once you have a core, of course, there is no need to get into the frequency domain at all. This hard work is completed conceptually and computationally. The convolution is very friendly to everyone, and you can rarely simplify it. Of course, smaller cores are friendlier. Sometimes a large core can be expressed as a convolution of small sub-cores, which is a kind of factoring in both mathematical and programmatic terms.

The mathematical process is quite simple and studied long before computers appeared. The most common manipulations can be done mechanically on an optical bench using 18th-century equipment.

+5
source

I think the best way to explain them is to start with 1d and discuss z-transform and its inverse. This switches from the time domain to the frequency domain - from describing a wave as a time sequence of samples to describing it as the amplitude of each frequency that contributes to it. Two representations contain the same amount of information; they simply express it differently.

Now suppose you had a wave described in the frequency domain, and you wanted to apply a filter to it. You might want to remove high frequencies. That would be blurry. You might want to remove the low frequencies. It will be sharpness or, in extreme cases, edge detection.

You can do this by simply pressing frequencies that you do not want to use 0 - for example. by multiplying the entire range by a specific mask, where 1 is the frequency you want to keep, and 0 is the frequency you want to eliminate.

But what if you want to do this in the time domain? You can transfer to the frequency domain, apply a mask, and then convert back. But that is a lot of work. So what you do (approximately) converts the mask from the frequency domain to the time domain. Then you can apply it in the time domain.

Following the math involved in the forward and backward transformation, theoretically, to apply this, you need to make each output sample a weighted sum of each individual input sample. In the real world, you make a compromise. You use the sum of, say, 9 samples. This gives you less latency and lower processing cost than using, say, 99 samples. But it also gives you a less accurate filter.

The graphic core is the 2nd analogue of this line of thought. They tend to be small because the cost of processing increases with the squared edge length, so it becomes very expensive. But you can approximate any filter that limits the frequency domain.

+2
source

Source: https://habr.com/ru/post/954463/


All Articles