An interesting problem. Since I think it should be possible to achieve better scaling, I investigated performance with a small βbenchmarkβ. With this test, I compared the performance of single-threaded and multi-threaded eig (multithreading is delivered through MKL LAPACK / BLAS procedures) with parallelization of IPython eig . To find out what the difference is, I would change the type of view, the number of engines and MKL threads, as well as the way the matrices are distributed over the engines.
Below are the results for the old dual-core AMD system:
m_size=300, n_mat=64, repeat=3 +------------------------------------+----------------------+ | settings | speedup factor | +--------+------+------+-------------+-----------+----------+ | func | neng | nmkl | view type | vs single | vs multi | +--------+------+------+-------------+-----------+----------+ | ip_map | 2 | 1 | direct_view | 1.67 | 1.62 | | ip_map | 2 | 1 | loadb_view | 1.60 | 1.55 | | ip_map | 2 | 2 | direct_view | 1.59 | 1.54 | | ip_map | 2 | 2 | loadb_view | 0.94 | 0.91 | | ip_map | 4 | 1 | direct_view | 1.69 | 1.64 | | ip_map | 4 | 1 | loadb_view | 1.61 | 1.57 | | ip_map | 4 | 2 | direct_view | 1.15 | 1.12 | | ip_map | 4 | 2 | loadb_view | 0.88 | 0.85 | | parfor | 2 | 1 | direct_view | 0.81 | 0.79 | | parfor | 2 | 1 | loadb_view | 1.61 | 1.56 | | parfor | 2 | 2 | direct_view | 0.71 | 0.69 | | parfor | 2 | 2 | loadb_view | 0.94 | 0.92 | | parfor | 4 | 1 | direct_view | 0.41 | 0.40 | | parfor | 4 | 1 | loadb_view | 1.62 | 1.58 | | parfor | 4 | 2 | direct_view | 0.34 | 0.33 | | parfor | 4 | 2 | loadb_view | 0.90 | 0.88 | +--------+------+------+-------------+-----------+----------+
As you can see, the performance gain varies greatly depending on the various settings used, with a maximum of 1.64 times more than conventional multi-threaded eig . In these results, the parfor function you used does not work well if MKL threads are not disabled in engines (using view.apply_sync(mkl.set_num_threads, 1) ).
Resizing the matrix also gives a noticeable difference. Speeding up the use of ip_map on direct_view with 4 engines and disabling MKL streams against the usual multi-threaded eig :
n_mat=32, repeat=3 +--------+----------+ | m_size | vs multi | +--------+----------+ | 50 | 0.78 | | 100 | 1.44 | | 150 | 1.71 | | 200 | 1.75 | | 300 | 1.68 | | 400 | 1.60 | | 500 | 1.57 | +--------+----------+
Apparently, for relatively small matrices, there is a penalty for performance, for the average size, the acceleration is the largest and for larger matrices, the acceleration decreases again. I could achieve a performance increase of 1.75, which would make using IPython.parallel worthwhile in my opinion.
Earlier, I did some tests on an Intel dual-core laptop, but I had some funny results, apparently the laptop was overheating. But in this system, accelerations were usually slightly lower, about 1.5-1.6 max.
Now I think that the answer to your question should be: it depends. Performance depends on the hardware, the BLAS / LAPACK library, the size of the problem, and how IPython.parallel deployed, among other things that I donβt know. And last but not least, is it worth it, does it depend on how much you consider the performance gain worthwhile.
The code I used:
from __future__ import print_function from numpy.random import rand from IPython.parallel import Client from mkl import set_num_threads from timeit import default_timer as clock from scipy.linalg import eig from functools import partial from itertools import product eig = partial(eig, right=False)
I donβt know if this is important, but to run the 4 parallel IPython mechanisms that I entered on the command line:
ipcluster start -n 4
Hope this helps :)