Well, in principle, np.bincount works with 1D arrays. But we must use it on each line iteratively (thinking about it is simple). To make it vectorized, we could compensate each row with this maximum number. The idea is to have different bins for each row so that they are not affected by other elements of the row with the same numbers.
Therefore, the implementation will be -
Run Example -
In [189]: a Out[189]: array([[1, 1, 0, 4], [2, 4, 2, 1], [1, 2, 3, 5], [4, 4, 4, 1]]) In [190]: bincount2D_vectorized(a) Out[190]: array([[1, 2, 0, 0, 1, 0], [0, 1, 2, 0, 1, 0], [0, 1, 1, 1, 0, 1], [0, 1, 0, 0, 3, 0]])
Numba tweaks
We can introduce numba to speed numba up. Now numba allows several settings.
First, it allows JIT compilation.
In addition, they recently introduced an experimental parallel , which automatically parallelizes operations in a function known as parallel semantics.
The final setup will be to use prange as a substitution for range . The docs claim this works parallel to loops, similar to OpenMP parallel for loops and Cythons prange. prange works well with larger datasets, which is probably due to the overhead required to set up parallel operation.
So, with these two new settings, along with njit for non-Python mode, we will have three options -
For completeness and testing later, the loopback version will be -
Runtime test
Case No. 1:
In [312]: a = np.random.randint(0,100,(100,100)) In [313]: %timeit bincount2D_loopy(a) ...: %timeit bincount2D_vectorized(a) ...: %timeit bincount2D_numba(a, use_parallel=False, use_prange=False) ...: %timeit bincount2D_numba(a, use_parallel=True, use_prange=False) ...: %timeit bincount2D_numba(a, use_parallel=True, use_prange=True) 10000 loops, best of 3: 115 µs per loop 10000 loops, best of 3: 36.7 µs per loop 10000 loops, best of 3: 22.6 µs per loop 10000 loops, best of 3: 22.7 µs per loop 10000 loops, best of 3: 39.9 µs per loop
Case No. 2:
In [316]: a = np.random.randint(0,100,(1000,1000)) In [317]: %timeit bincount2D_loopy(a) ...: %timeit bincount2D_vectorized(a) ...: %timeit bincount2D_numba(a, use_parallel=False, use_prange=False) ...: %timeit bincount2D_numba(a, use_parallel=True, use_prange=False) ...: %timeit bincount2D_numba(a, use_parallel=True, use_prange=True) 100 loops, best of 3: 2.97 ms per loop 100 loops, best of 3: 3.54 ms per loop 1000 loops, best of 3: 1.83 ms per loop 100 loops, best of 3: 1.78 ms per loop 1000 loops, best of 3: 1.4 ms per loop
Case No. 3:
In [318]: a = np.random.randint(0,1000,(1000,1000)) In [319]: %timeit bincount2D_loopy(a) ...: %timeit bincount2D_vectorized(a) ...: %timeit bincount2D_numba(a, use_parallel=False, use_prange=False) ...: %timeit bincount2D_numba(a, use_parallel=True, use_prange=False) ...: %timeit bincount2D_numba(a, use_parallel=True, use_prange=True) 100 loops, best of 3: 4.01 ms per loop 100 loops, best of 3: 4.86 ms per loop 100 loops, best of 3: 3.21 ms per loop 100 loops, best of 3: 3.18 ms per loop 100 loops, best of 3: 2.45 ms per loop
numba options numba work very well. The choice of one of the three options will depend on the shape parameters of the input array and to some extent on the number of unique elements in it.