How to parallelize this Python loop to use Numba

Question

How to parallelize this Python loop to use Numba

I use the Anaconda Python distribution with Numba, and I wrote the following Python function that multiplies the sparse matrix A (stored in CSR format) using a dense vector x :

 @jit def csrMult( x, Adata, Aindices, Aindptr, Ashape ): numRowsA = Ashape[0] Ax = numpy.zeros( numRowsA ) for i in range( numRowsA ): Ax_i = 0.0 for dataIdx in range( Aindptr[i], Aindptr[i+1] ): j = Aindices[dataIdx] Ax_i += Adata[dataIdx] * x[j] Ax[i] = Ax_i return Ax

Here A is a large sparse scipy matrix,

 >>> A.shape ( 56469, 39279 ) # having ~ 142,258,302 nonzero entries (so about 6.4% ) >>> type( A[0,0] ) dtype( 'float32' )

and x is a numpy array. Here is a piece of code that calls the function above:

 x = numpy.random.randn( A.shape[1] ) Ax = A.dot( x ) AxCheck = csrMult( x, A.data, A.indices, A.indptr, A.shape )

Pay attention to the @jit , which tells Numba to compile at the exact moment in time for the csrMult() function.

In my experiments, my csrMult() function is about twice as fast as the scipy .dot() method. This is a pretty impressive result for Numba.

However, MATLAB still performs this matrix vector multiplication about 6 times faster than csrMult() . I believe this is due to the fact that MATLAB uses multithreading when performing sparse matrix-vector multiplication.

Question:

How can I parallelize external for -loop when using Numba?

Numba had a prange() function that simplified parallel parallel parallel for -loops layouts. Unfortunately, Numba no longer has prange() [ , that is, false, see below below ]. So, what is the correct way to parallelize this for -loop, now the Numba prange() function has disappeared?

When prange() was removed from Numba, what alternative did the Numba developers have?

Change 1:
I upgraded to the latest version of Numba, which is .35 and prange() back! It was not included in version .33, the version I used.
This is good news, but unfortunately I get an error when trying to parallelize a for loop using prange() . Here is a parallel for the loop example from the Numba documentation (see Section 1.9.2 “Explicit Parallel Loops”), and below is my new code

 from numba import njit, prange @njit( parallel=True ) def csrMult_numba( x, Adata, Aindices, Aindptr, Ashape): numRowsA = Ashape[0] Ax = np.zeros( numRowsA ) for i in prange( numRowsA ): Ax_i = 0.0 for dataIdx in range( Aindptr[i],Aindptr[i+1] ): j = Aindices[dataIdx] Ax_i += Adata[dataIdx] * x[j] Ax[i] = Ax_i return Ax

When I call this function using the code snippet above, I get the following error:

AttributeError: crash on nopython (conversion to parfors) 'SetItem' object does not have attribute 'get_targets'

Given the attempt to use `prange` described above, my question is:

What is the correct way (using prange or an alternative method) to parallelize this Python for -loop?

As noted below, it was trivial to parallelize a similar for loop in C ++ and get 8x acceleration, which was run on 20 -omp-threads. There must be a way to do this with Numba, since the for loop is awkwardly parallel (and since sparse matrix-vector multiplication is a fundamental operation in scientific computing).

Edit 2:
Here is my C ++ version of csrMult() . Parallelizing the for() loop in C ++ version makes the code 8 times faster than in my tests. This suggests that a similar speed should be possible for the Python version when using Numba.

 void csrMult(VectorXd& Ax, VectorXd& x, vector<double>& Adata, vector<int>& Aindices, vector<int>& Aindptr) { // This code assumes that the size of Ax is numRowsA. #pragma omp parallel num_threads(20) { #pragma omp for schedule(dynamic,590) for (int i = 0; i < Ax.size(); i++) { double Ax_i = 0.0; for (int dataIdx = Aindptr[i]; dataIdx < Aindptr[i + 1]; dataIdx++) { Ax_i += Adata[dataIdx] * x[Aindices[dataIdx]]; } Ax[i] = Ax_i; } } }

+5

python parallel-processing sparse-matrix anaconda numba

littleO Oct 25 '17 at 4:37

source share

2 answers

Numba is updated and prange() now works! _{(I answer my question.)}

Numba concurrency enhancements are discussed in this December 12, 2017 blog post . The following is a snippet from the blog:

Once upon a time (over 20 releases!), Numba had idiom support to write parallel for loops called prange() . After a major refactoring of the code base in 2014, this function had to be removed, but it was one of the most frequently requested Numba functions from now on. After parallelizing the array of Intel expressions, they realized that returning prange would be fair easily

Using Numba version 0.36.1, I can parallelize my awkwardly parallel for -loop using the following simple code:

 @numba.jit(nopython=True, parallel=True) def csrMult_parallel(x,Adata,Aindices,Aindptr,Ashape): numRowsA = Ashape[0] Ax = np.zeros(numRowsA) for i in numba.prange(numRowsA): Ax_i = 0.0 for dataIdx in range(Aindptr[i],Aindptr[i+1]): j = Aindices[dataIdx] Ax_i += Adata[dataIdx]*x[j] Ax[i] = Ax_i return Ax

In my experiments, parallelizing for -loop, the function ran about eight times faster than the version I posted at the beginning of my question, which already used Numba, but which was not parallelized. Moreover, in my experiments, the parallel version is about 5 times faster than the Ax = A.dot(x) command, which uses the scipy sparse matrix-vector multiplication function. Numba chopped scipy , and I finally have a simple procedure for multiplying by the python matrix vector, which is as fast as MATLAB .

+3

littleO Dec 15 '17 at 10:22

source share

user3666197 · Accepted Answer · 2017-10-25T11:38:28+0000

_{Thanks for your quantum updates, Daniel.} _{The following lines may be difficult to grasp, but kindly believe me, there are many more things to consider.} _{I worked on HPC / parallel computing problems having matrices on the ~ N [TB]; N > 10 scale N [TB]; N > 10} _{N [TB]; N > 10 and their rare promotions, so some fragments of the experience may be useful for your further views.}

WARNING: Do not expect lunch to be served free

The desire to parallelize a piece of code sounds like an increasingly modern overridden mana. The problem is not the code, but the cost of such a move.

Economics is the number one problem. Amdahl’s law, as it was originally formulated by Gene Amdal, did not take into account the cost of the [PAR] -processes-installations + [PAR] -processes-endings and endings that really need to be paid for in every real world implementation.

The overhead of Amdahl Law shows the scale of these unwanted side effects and helps to understand a few new aspects that need to be assessed before one prefers to introduce parallelization (at an affordable cost, since it’s really, really, VERY EASY to pay MORE than you can to get from - where the naive disappointment of the degraded processing characteristics is - the easier part is the story)

Feel free to read more posts about rethinking Amdahl's law if you want to better understand this topic and pre-compute the actual "minimum" -subProblem- "size" for which sum-of- [PAR] -overheads will receive at least a legitimate real-world tool for introducing a parallel subdivision of the subroutine into N_trully_[PAR]_processes (and not "just" - [CONCURRENT] , but true- [PARALLEL] is not the same).

Python can get a dose of steroids to increase productivity:

Python is an excellent prototyping ecosystem, while numba , numpy and other compiled extensions help to significantly improve performance than the native one. As a rule, it handles GIL step python (co-) processing.

Here you are trying to force numba.jit() to organize the work almost for free, simply using your automated jit() -time lexico-analyzer (which you throw into your code) which should "understand" your global goal ( What ), and also offer some vectorization techies ( What is the best way to collect a bunch of CPU instructions for maximum efficiency is code execution).

It sounds simple, but it is not.

The Travis Oliphant team has made tremendous progress on numba tools, but let's be realistic and fair so as not to expect any form of automated magic to be implemented inside .jit() -lexer + analysis code when trying to convert the code and put together a more efficient stream of machine instructions to achieve a high level goal.

`@guvectorize` ? Here? Really?

Due to the size of [PSPACE] you may immediately forget to ask numba to somehow efficiently “fill” the GPU with data whose memory size falls behind the GPU-GDDR values (not to mention all the too “shallow” sizes of the GPU core for such mathematically "tiny" processing, just to multiply, perhaps in [PAR] , but later summarize in [SEQ] ).

(Re -) - Loading a GPU with data takes a lot of time. If you pay for this, the delays in the memory of the GPUs are not very friendly for saving "tiny" -GPU cores: your GPU-SMX code will have to pay ~ 350-700 [ns] to get the number (most likely it will not automatically reconfigured for better combined reuse using the SM cache in the next steps, and you may notice that you never let me repeat it, NEVER use one matrix cell at all, per-se caching will not deliver anything under these 350~700 [ns] at a cell matrix), then to to pure intellectual numpy -programmirovanny code can handle a matrix-vector product is less than 1 [ns] in the cell, even the most [PSPACE] -footprints.

This is a criterion for comparison.

(Profiling will better demonstrate hard facts here, but the principle is well known in advance, without checking how to transfer multiple TB data to the GPU fabric to implement this on one's own.)

Worst Bad News:

Given the scale of the memory of matrix A , the worst effect to be expected is that the sparse storage arrangement of the matrix representation is likely to lead to the devastation of most, if not all, possible performance gains achieved by numba vector tricks on dense matrix representations, since it will probably be a near-zero chance for efficient reuse of cached cached memory, as well as gaps, any easy way to achieve compact The affections of vectorized operations, and they are unlikely to remain capable of easily transferring to the advanced processing resources of the processor's vector resources.

The list of resolved problems:

it is always better to precede the vector Ax = np.zeros_like( A[:,0] ) and pass it as another parameter to the numba.jit() -compiled parts of the code to avoid paying extra for [PTIME,PSPACE] bones to create ( again) of new memory -allocations (especially if the vector is suspicious for use in the process of ordered iterative optimization from the outside)
it is always better to specify (to narrow down universality for the sake of code performance), at least, directives of the numba.jit( "f8[:]( f4[:], f4[:,:], ... )" ) interface numba.jit( "f8[:]( f4[:], f4[:,:], ... )" ) -calling
always look at all available numba.jit() descriptions and their corresponding default values _{(can change version to version)} for your specific situation (disable GIL and better coordinate goals with numba + hardware capabilities will always help in numerically-intensive parts of the code )

 @jit( signature = [ numba.float32( numba.float32, numba.int32 ), # # [_v41] @decorator with a list of calling-signatures numba.float64( numba.float64, numba.int64 ) # ], #__________________ a list of signatures for prepared alternative code-paths, to avoid a deferred lazy-compilation if undefined nopython = False, #__________________ forces the function to be compiled in nopython mode. If not possible, compilation will raise an error. nogil = False, #__________________ tries to release the global interpreter lock inside the compiled function. The GIL will only be released if Numba can compile the function in nopython mode, otherwise a compilation warning will be printed. cache = False, #__________________ enables a file-based cache to shorten compilation times when the function was already compiled in a previous invocation. The cache is maintained in the __pycache__ subdirectory of the directory containing the source file. forceobj = False, #__________________ forces the function to be compiled in object mode. Since object mode is slower than nopython mode, this is mostly useful for testing purposes. locals = {} #__________________ a mapping of local variable names to Numba Types. ) #____________________# [_v41] ZERO <____ TEST *ALL* CALLED sub-func()-s to @.jit() too >>>>>>>>>>>>>>>>>>>>> [DONE] def r...(...): ...

How to parallelize this Python loop to use Numba

Question:

Given the attempt to use prange described above, my question is:

WARNING: Do not expect lunch to be served free

Python can get a dose of steroids to increase productivity:

@guvectorize ? Here? Really?

Worst Bad News:

The list of resolved problems:

More articles:

Given the attempt to use `prange` described above, my question is:

`@guvectorize` ? Here? Really?