A faster way to calculate the sum of the squared differences between the image (M, N) and the pattern (3, 3) for pattern matching?

Question

A faster way to calculate the sum of the squared differences between the image (M, N) and the pattern (3, 3) for pattern matching?

I implement a texture synthesis algorithm as described here . To do this, I need to calculate the Sum of Squared Differences, a metric for evaluating the error between the template and the various positions in the image . I am performing a slow working implementation as follows:

 total_weight = valid_mask.sum() for i in xrange(input_image.shape[0]): for j in xrange(input_image.shape[1]): sample = image[i:i + window, j:j + window] dist = (template - sample) ** 2 ssd[i, j] = (dist * valid_mask).sum() / total_weight

Here total_weight only for normalization. Some pixels have an unknown intensity, so I use valid_mask to mask them. This nested loop lies inside 2 loops, so there are 4 nested loops, which are obviously performance killers!

Is there a way to make this faster in NumPy or Python by replacing this nested loop? Is an attachment possible? I need to work with the (3, 3) part of image using (3, 3) template .

Subsequently, I will implement this in Cython, so the faster I can get it to work using only NumPy, the better.

Here you can find the full code here . Line 62 - 67 is shown here.

Thanks,
Chintak

+6

python vectorization numpy scipy image-processing

Chintak Jul 26 '13 at 12:36

source share

4 answers

You can do some awesome things with the as_strided function in combination with numpy cast. Here are two versions of your function:

 import numpy as np from numpy.lib.stride_tricks import as_strided def sumsqdiff(input_image, template, valid_mask=None): if valid_mask is None: valid_mask = np.ones_like(template) total_weight = valid_mask.sum() window_size = template.shape ssd = np.empty((input_image.shape[0] - window_size[0] + 1, input_image.shape[1] - window_size[1] + 1)) for i in xrange(ssd.shape[0]): for j in xrange(ssd.shape[1]): sample = input_image[i:i + window_size[0], j:j + window_size[1]] dist = (template - sample) ** 2 ssd[i, j] = (dist * valid_mask).sum() return ssd def sumsqdiff2(input_image, template, valid_mask=None): if valid_mask is None: valid_mask = np.ones_like(template) total_weight = valid_mask.sum() window_size = template.shape # Create a 4-D array y, such that y[i,j,:,:] is the 2-D window # input_image[i:i+window_size[0], j:j+window_size[1]] y = as_strided(input_image, shape=(input_image.shape[0] - window_size[0] + 1, input_image.shape[1] - window_size[1] + 1,) + window_size, strides=input_image.strides * 2) # Compute the sum of squared differences using broadcasting. ssd = ((y - template) ** 2 * valid_mask).sum(axis=-1).sum(axis=-1) return ssd

This is where the ipython session is executed for comparison.

The template that I will use for demonstration:

 In [72]: template Out[72]: array([[-1, 1, -1], [ 1, 2, 1], [-1, 1, -1]])

A little input so that we can check the result:

 In [73]: x Out[73]: array([[ 0., 1., 2., 3., 4., 5., 6.], [ 7., 8., 9., 10., 11., 12., 13.], [ 14., 15., 16., 17., 18., 19., 20.], [ 21., 22., 23., 24., 25., 26., 27.], [ 28., 29., 30., 31., 32., 33., 34.]])

Apply two functions to x and make sure we get the same result:

 In [74]: sumsqdiff(x, template) Out[74]: array([[ 856., 1005., 1172., 1357., 1560.], [ 2277., 2552., 2845., 3156., 3485.], [ 4580., 4981., 5400., 5837., 6292.]]) In [75]: sumsqdiff2(x, template) Out[75]: array([[ 856., 1005., 1172., 1357., 1560.], [ 2277., 2552., 2845., 3156., 3485.], [ 4580., 4981., 5400., 5837., 6292.]])

Now make a much larger “image” input:

 In [76]: z = np.random.randn(500, 500)

and check the performance:

 In [77]: %timeit sumsqdiff(z, template) 1 loops, best of 3: 3.55 s per loop In [78]: %timeit sumsqdiff2(z, template) 10 loops, best of 3: 33 ms per loop

Not too shabby. :)

Two disadvantages:

The calculation in sumsqdiff2 will generate a temporary array, which for a 3x3 template will be 9 times the size of input_image . (In general, this will be template.size times larger than input_image .)
These walking tricks will not help you when you execute the Cythonize code. When you go into Cython, you often return to loops that you free when you vectorize with numpy.

+6

Warren weckesser Jul 26 '13 at 13:51

source share

I think you performed your algorithm well. Vectorization is an option, but I recommend that you use the Numba optimizing compiler, which compiles Python syntax into machine code. The Numba effect is impressive. Numba vs. Cython: Take 2 is a very brief introduction to Numba and performance comparison.

0

Amir Jul 26 '13 at 12:59

source share

You might want to check how it works if you change the algorithm to perform series calculations. The idea is that you can use the processor cache better if you read memory sequentially.

Pseudocode:

 for template_row in template: for row in image: for col in image: # find distance template_row to sample_row # add sum to ssd[row - template_row, col]

Actual code (after Warren):

 def sumsqdiffr(input_image, template, valid_mask=None): if valid_mask is None: valid_mask = np.ones_like(template) total_weight = valid_mask.sum() window_size = template.shape ssd = np.zeros((input_image.shape[0] - window_size[0] + 1, input_image.shape[1] - window_size[1] + 1)) for tr in xrange(template.shape[0]): for i in xrange(tr, ssd.shape[0] + tr): for j in xrange(ssd.shape[1]): sample = input_image[i, j:j + window_size[1]] dist = (template[tr] - sample) ** 2 ssd[i - tr, j] += (dist * valid_mask[tr]).sum() return ssd

This is more than twice as slow as the original implementation.

(If someone would like to tell me if the whole idea was wrong or what caused it, I would be happy to get some understanding in it)

0

wm Jul 26 '13 at 13:31

source share

Jaime · Accepted Answer · 2013-07-26T16:01:55+0000

This is basically an improvement over Warren Uckenser's answer. Obviously, a path with a multidimensional window view of the original array, but you want this view to not start a copy. If you expand sum((ab)**2) , you can turn it into sum(a**2) + sum(b**2) - 2*sum(a*b) , and this is multiply-then- reduce-with-sum, which you can perform with linear algebra operators, with a significant improvement in both performance and memory usage:

 def sumsqdiff3(input_image, template): window_size = template.shape y = as_strided(input_image, shape=(input_image.shape[0] - window_size[0] + 1, input_image.shape[1] - window_size[1] + 1,) + window_size, strides=input_image.strides * 2) ssd = np.einsum('ijkl,kl->ij', y, template) ssd *= - 2 ssd += np.einsum('ijkl, ijkl->ij', y, y) ssd += np.einsum('ij, ij', template, template) return ssd In [288]: img = np.random.rand(500, 500) In [289]: template = np.random.rand(3, 3) In [290]: %timeit a = sumsqdiff2(img, template) # Warren function 10 loops, best of 3: 59.4 ms per loop In [291]: %timeit b = sumsqdiff3(img, template) 100 loops, best of 3: 18.2 ms per loop In [292]: np.allclose(a, b) Out[292]: True

I specifically selected the valid_mask parameter because I don’t quite understand how you use it. Basically, just zeroing out the corresponding values in template and / or input_image should do the same trick.

A faster way to calculate the sum of the squared differences between the image (M, N) and the pattern (3, 3) for pattern matching?

More articles: