Optimal (broadcast) matrix division in numpy. Avoid temporary arrays or not?

Numpy allows you to add / multiply / separate matrices of different sizes, subject to certain broadcast rules . In addition, creating temporary arrays is a major speed hurdle for numpy.

The following lull results surprise me ... what happens?

In [41]: def f_no_dot(mat,arr): ....: return mat/arr In [42]: def f_dot(mat,arr): ....: denominator = scipy.dot(arr, scipy.ones((1,2))) ....: return mat/denominator In [43]: mat = scipy.rand(360000,2) In [44]: arr = scipy.rand(360000,1) In [45]: timeit temp = f_no_dot(mat,arr) 10 loops, best of 3: 44.7 ms per loop In [46]: timeit temp = f_dot(mat,arr) 100 loops, best of 3: 10.1 ms per loop 

I thought f_dot would be slower since he had to create a temporary denominator of arrays, and I assumed that f_no_dot was skipped this step. I should note that these times vary linearly (with array size up to 1 billion long) for f_no_dot and slightly worse than linear for f_dot.

+6
source share
2 answers

What you see is most likely related to iteration compared to the small size (2,) . Numpy (versions <1.6) were effectively handled only with operations involving continuous arrays (of the same shape). The effect disappears as the size of the last dimension increases.

To see the adjacency effect:

 In [1]: import numpy In [2]: numpy.__version__ Out[2]: '1.5.1' In [3]: arr_cont1 = numpy.random.rand(360000, 2) In [4]: arr_cont2 = numpy.random.rand(360000, 2) In [5]: arr_noncont = numpy.random.rand(360000, 4)[:,::2] In [6]: arr_bcast = numpy.random.rand(360000, 1) In [7]: %timeit arr_cont1 / arr_cont2 100 loops, best of 3: 5.75 ms per loop In [8]: %timeit arr_noncont / arr_cont2 10 loops, best of 3: 54.4 ms per loop In [9]: %timeit arr_bcast / arr_cont2 10 loops, best of 3: 55.2 ms per loop 
The situation, however, improved significantly in Nump> = 1.6.0:
 In [1]: import numpy In [2]: numpy.__version__ Out[2]: '1.6.1' In [3]: arr_cont1 = numpy.random.rand(360000, 2) In [4]: arr_cont2 = numpy.random.rand(360000, 2) In [5]: arr_noncont = numpy.random.rand(360000, 4)[:,::2] In [6]: arr_bcast = numpy.random.rand(360000, 1) In [7]: %timeit arr_cont1 / arr_cont2 100 loops, best of 3: 5.37 ms per loop In [8]: %timeit arr_noncont / arr_cont2 100 loops, best of 3: 6.12 ms per loop In [9]: %timeit arr_bcast / arr_cont2 100 loops, best of 3: 7.81 ms per loop 
(All timings above are probably only accurate to 1 ms.)

Note also that temporary data is not that expensive:

 In [82]: %timeit arr_cont1.copy() 1000 loops, best of 3: 778 us per loop 

EDIT . The note above, also, that arr_noncont is a sort adjacent to 2*itemsize , so the inner loop can be broken - Numpy can do it about as fast as a continuous array. When broadcasting (or with a truly non-contiguous array, for example numpy.random.rand(360000*2, 2)[::2,:] , the inner loop cannot be dissolved, and these cases are slightly slower). Improving this would be possible if Numpy chose an individual machine code "on the fly" for each cycle, but he did not do it (at least for now :)

+2
source

I thought f_dot would be slower since it was supposed to create a temporary denominator of arrays, and I assumed that f_no_dot was skipped this step.

For what it's worth, a temporary array is created, so f_no_dot is slower (but uses less memory).

Elemental operations on arrays of the same size are faster, because numpy does not need to worry about the step (sizes, size, etc.) of the arrays.

Operations using broadcasting will usually be slightly slower than operations that are not required.

If you have spare memory, creating a temporary copy may give you speed, but will use more memory.

For example, comparing these three functions:

 import numpy as np import timeit def f_no_dot(x, y): return x / y def f_dot(x, y): denom = np.dot(y, np.ones((1,2))) return x / denom def f_in_place(x, y): x /= y return x num = 3600000 x = np.ones((num, 2)) y = np.ones((num, 1)) for func in ['f_dot', 'f_no_dot', 'f_in_place']: t = timeit.timeit('%s(x,y)' % func, number=100, setup='from __main__ import x,y,f_dot, f_no_dot, f_in_place') print func, 'time...' print t / 100.0 

This gives similar deadlines for your results:

 f_dot time... 0.184361531734 f_no_dot time... 0.619203259945 f_in_place time... 0.585789341927 

However, if we compare memory usage, things get a little clearer ...

The combined size of the x and y arrays is about 27.5 + 55 MB or 82 MB (for 64-bit ints). There is an additional ~ 11 MB of service data when importing numpy, etc.

Returning x / y as a new array (i.e. without making x /= y ), another 55 megabyte array is required.

100 runs of f_dot : We are creating a temporary array here, so we expect to see 11 + 82 + 55 + 55 MB or ~ 203 MB of memory usage. And this is what we see ... enter image description here

100 f_no_dot runs: If a temporary array is not created, we expect peak memory to use 11 + 82 + 55 MB, or 148 MB ...
enter image description here ... this is exactly what we see.

So x / y does not create an extra temporary array of num x 2 for division.

Thus, the division takes a little longer than if it worked on two arrays of the same size.

100 f_in_place runs: If we can change x in place, we can save even more memory if this is the main problem. enter image description here

Basically, numpy tries to save memory at the expense of speed, in some cases.

+5
source

Source: https://habr.com/ru/post/899305/


All Articles