Thanks for your quantum updates, Daniel.
The following lines may be difficult to grasp, but kindly believe me, there are many more things to consider. I worked on HPC / parallel computing problems having matrices on the ~ N [TB]; N > 10 scale N [TB]; N > 10 N [TB]; N > 10 and their rare promotions, so some fragments of the experience may be useful for your further views.
WARNING: Do not expect lunch to be served free
The desire to parallelize a piece of code sounds like an increasingly modern overridden mana. The problem is not the code, but the cost of such a move.
Economics is the number one problem. Amdahl’s law, as it was originally formulated by Gene Amdal, did not take into account the cost of the [PAR] -processes-installations + [PAR] -processes-endings and endings that really need to be paid for in every real world implementation.
The overhead of Amdahl Law shows the scale of these unwanted side effects and helps to understand a few new aspects that need to be assessed before one prefers to introduce parallelization (at an affordable cost, since it’s really, really, VERY EASY to pay MORE than you can to get from - where the naive disappointment of the degraded processing characteristics is - the easier part is the story)
Feel free to read more posts about rethinking Amdahl's law if you want to better understand this topic and pre-compute the actual "minimum" -subProblem- "size" for which sum-of- [PAR] -overheads will receive at least a legitimate real-world tool for introducing a parallel subdivision of the subroutine into N_trully_[PAR]_processes (and not "just" - [CONCURRENT] , but true- [PARALLEL] is not the same).
Python can get a dose of steroids to increase productivity:
Python is an excellent prototyping ecosystem, while numba , numpy and other compiled extensions help to significantly improve performance than the native one. As a rule, it handles GIL step python (co-) processing.
Here you are trying to force numba.jit() to organize the work almost for free, simply using your automated jit() -time lexico-analyzer (which you throw into your code) which should "understand" your global goal ( What ), and also offer some vectorization techies ( What is the best way to collect a bunch of CPU instructions for maximum efficiency is code execution).
It sounds simple, but it is not.
The Travis Oliphant team has made tremendous progress on numba tools, but let's be realistic and fair so as not to expect any form of automated magic to be implemented inside .jit() -lexer + analysis code when trying to convert the code and put together a more efficient stream of machine instructions to achieve a high level goal.
@guvectorize ? Here? Really?
Due to the size of [PSPACE] you may immediately forget to ask numba to somehow efficiently “fill” the GPU with data whose memory size falls behind the GPU-GDDR values (not to mention all the too “shallow” sizes of the GPU core for such mathematically "tiny" processing, just to multiply, perhaps in [PAR] , but later summarize in [SEQ] ).
(Re -) - Loading a GPU with data takes a lot of time. If you pay for this, the delays in the memory of the GPUs are not very friendly for saving "tiny" -GPU cores: your GPU-SMX code will have to pay ~ 350-700 [ns] to get the number (most likely it will not automatically reconfigured for better combined reuse using the SM cache in the next steps, and you may notice that you never let me repeat it, NEVER use one matrix cell at all, per-se caching will not deliver anything under these 350~700 [ns] at a cell matrix), then to to pure intellectual numpy -programmirovanny code can handle a matrix-vector product is less than 1 [ns] in the cell, even the most [PSPACE] -footprints.
This is a criterion for comparison.
(Profiling will better demonstrate hard facts here, but the principle is well known in advance, without checking how to transfer multiple TB data to the GPU fabric to implement this on one's own.)
Worst Bad News:
Given the scale of the memory of matrix A , the worst effect to be expected is that the sparse storage arrangement of the matrix representation is likely to lead to the devastation of most, if not all, possible performance gains achieved by numba vector tricks on dense matrix representations, since it will probably be a near-zero chance for efficient reuse of cached cached memory, as well as gaps, any easy way to achieve compact The affections of vectorized operations, and they are unlikely to remain capable of easily transferring to the advanced processing resources of the processor's vector resources.
The list of resolved problems:
- it is always better to precede the vector
Ax = np.zeros_like( A[:,0] ) and pass it as another parameter to the numba.jit() -compiled parts of the code to avoid paying extra for [PTIME,PSPACE] bones to create ( again) of new memory -allocations (especially if the vector is suspicious for use in the process of ordered iterative optimization from the outside) - it is always better to specify (to narrow down universality for the sake of code performance), at least, directives of the
numba.jit( "f8[:]( f4[:], f4[:,:], ... )" ) interface numba.jit( "f8[:]( f4[:], f4[:,:], ... )" ) -calling - always look at all available
numba.jit() descriptions and their corresponding default values (can change version to version) for your specific situation (disable GIL and better coordinate goals with numba + hardware capabilities will always help in numerically-intensive parts of the code )
@jit( signature = [ numba.float32( numba.float32, numba.int32 ), # # [_v41] @decorator with a list of calling-signatures numba.float64( numba.float64, numba.int64 ) # ], #__________________ a list of signatures for prepared alternative code-paths, to avoid a deferred lazy-compilation if undefined nopython = False, #__________________ forces the function to be compiled in nopython mode. If not possible, compilation will raise an error. nogil = False, #__________________ tries to release the global interpreter lock inside the compiled function. The GIL will only be released if Numba can compile the function in nopython mode, otherwise a compilation warning will be printed. cache = False, #__________________ enables a file-based cache to shorten compilation times when the function was already compiled in a previous invocation. The cache is maintained in the __pycache__ subdirectory of the directory containing the source file. forceobj = False, #__________________ forces the function to be compiled in object mode. Since object mode is slower than nopython mode, this is mostly useful for testing purposes. locals = {} #__________________ a mapping of local variable names to Numba Types. ) #____________________# [_v41] ZERO <____ TEST *ALL* CALLED sub-func()-s to @.jit() too >>>>>>>>>>>>>>>>>>>>> [DONE] def r...(...): ...