Should I use IPython in parallel with scipy eig?

I am writing code that must calculate a large number of problems with eigenvalues ​​(the typical dimension of the matrices is several hundred). I was wondering if it is possible to speed up the process using the IPython.parallel module. As a former MATLAB user and a Python parfor , I was looking for something similar to MATLAB parfor ...

Following some online tutorials, I wrote simple code to check if it speeds up the calculation at all, and I found out that it doesn't and often actually slows it down (depending on the case). I think I could miss the point in it, and perhaps scipy.linalg.eig implemented in such a way that it uses all available kernels and trying to parallelize it, I interrupt the engine control.

Here is the code "parralel":

 import numpy as np from scipy.linalg import eig from IPython import parallel #create the matrices matrix_size = 300 matrices = {} for i in range(100): matrices[i] = np.random.rand(matrix_size, matrix_size) rc = parallel.Client() lview = rc.load_balanced_view() results = {} #compute the eigenvalues for i in range(len(matrices)): asyncresult = lview.apply(eig, matrices[i], right=False) results[i] = asyncresult for i, asyncresult in results.iteritems(): results[i] = asyncresult.get() 

Unparalleled option:

 #no parallel for i in range(len(matrices)): results[i] = eig(matrices[i], right=False) 

The difference in processor time for the two is very subtle. If a parallelized function must perform some more matrix operations on an eigenvalue problem, it starts to last forever, that is, at least 5 times longer than the non-parallelized version.

Is it true that problems with eigenvalues ​​are really not suitable for this kind of parallelization, or will I skip the whole point?

Thank you very much!

EDITED July 29, 2013; 12:20 BST

Following the suggestion of moarningsun, I tried to start eig by fixing the number of threads with mkl.set_num_threads . For a matrix of size 500 by 500, the minimum time equal to 50 repetitions is as follows:

 No of. threads minimum time(timeit) CPU usage(Task Manager) ================================================================= 1 0.4513775764796151 12-13% 2 0.36869288559927327 25-27% 3 0.34014644287680085 38-41% 4 0.3380558903450037 49-53% 5 0.33508234276183657 49-53% 6 0.3379019065051807 49-53% 7 0.33858615048501406 49-53% 8 0.34488405094054997 49-53% 9 0.33380300334101776 49-53% 10 0.3288481198342197 49-53% 11 0.3512653110685733 49-53% 

Apart from one case of flow, there is no significant difference (maybe 50 samples a little to small ...). I still think that I am lacking in meaning, and much can be done to improve performance, but I'm not quite sure how to do it. They were run on a 4-core computer with hyperthreading capability, providing 4 virtual cores.

Thanks for any input!

+4
source share
1 answer

An interesting problem. Since I think it should be possible to achieve better scaling, I investigated performance with a small β€œbenchmark”. With this test, I compared the performance of single-threaded and multi-threaded eig (multithreading is delivered through MKL LAPACK / BLAS procedures) with parallelization of IPython eig . To find out what the difference is, I would change the type of view, the number of engines and MKL threads, as well as the way the matrices are distributed over the engines.

Below are the results for the old dual-core AMD system:

  m_size=300, n_mat=64, repeat=3 +------------------------------------+----------------------+ | settings | speedup factor | +--------+------+------+-------------+-----------+----------+ | func | neng | nmkl | view type | vs single | vs multi | +--------+------+------+-------------+-----------+----------+ | ip_map | 2 | 1 | direct_view | 1.67 | 1.62 | | ip_map | 2 | 1 | loadb_view | 1.60 | 1.55 | | ip_map | 2 | 2 | direct_view | 1.59 | 1.54 | | ip_map | 2 | 2 | loadb_view | 0.94 | 0.91 | | ip_map | 4 | 1 | direct_view | 1.69 | 1.64 | | ip_map | 4 | 1 | loadb_view | 1.61 | 1.57 | | ip_map | 4 | 2 | direct_view | 1.15 | 1.12 | | ip_map | 4 | 2 | loadb_view | 0.88 | 0.85 | | parfor | 2 | 1 | direct_view | 0.81 | 0.79 | | parfor | 2 | 1 | loadb_view | 1.61 | 1.56 | | parfor | 2 | 2 | direct_view | 0.71 | 0.69 | | parfor | 2 | 2 | loadb_view | 0.94 | 0.92 | | parfor | 4 | 1 | direct_view | 0.41 | 0.40 | | parfor | 4 | 1 | loadb_view | 1.62 | 1.58 | | parfor | 4 | 2 | direct_view | 0.34 | 0.33 | | parfor | 4 | 2 | loadb_view | 0.90 | 0.88 | +--------+------+------+-------------+-----------+----------+ 

As you can see, the performance gain varies greatly depending on the various settings used, with a maximum of 1.64 times more than conventional multi-threaded eig . In these results, the parfor function you used does not work well if MKL threads are not disabled in engines (using view.apply_sync(mkl.set_num_threads, 1) ).

Resizing the matrix also gives a noticeable difference. Speeding up the use of ip_map on direct_view with 4 engines and disabling MKL streams against the usual multi-threaded eig :

  n_mat=32, repeat=3 +--------+----------+ | m_size | vs multi | +--------+----------+ | 50 | 0.78 | | 100 | 1.44 | | 150 | 1.71 | | 200 | 1.75 | | 300 | 1.68 | | 400 | 1.60 | | 500 | 1.57 | +--------+----------+ 

Apparently, for relatively small matrices, there is a penalty for performance, for the average size, the acceleration is the largest and for larger matrices, the acceleration decreases again. I could achieve a performance increase of 1.75, which would make using IPython.parallel worthwhile in my opinion.

Earlier, I did some tests on an Intel dual-core laptop, but I had some funny results, apparently the laptop was overheating. But in this system, accelerations were usually slightly lower, about 1.5-1.6 max.

Now I think that the answer to your question should be: it depends. Performance depends on the hardware, the BLAS / LAPACK library, the size of the problem, and how IPython.parallel deployed, among other things that I don’t know. And last but not least, is it worth it, does it depend on how much you consider the performance gain worthwhile.

The code I used:

 from __future__ import print_function from numpy.random import rand from IPython.parallel import Client from mkl import set_num_threads from timeit import default_timer as clock from scipy.linalg import eig from functools import partial from itertools import product eig = partial(eig, right=False) # desired keyword arg as standard class Bench(object): def __init__(self, m_size, n_mat, repeat=3): self.n_mat = n_mat self.matrix = rand(n_mat, m_size, m_size) self.repeat = repeat self.rc = Client() def map(self): results = map(eig, self.matrix) def ip_map(self): results = self.view.map_sync(eig, self.matrix) def parfor(self): results = {} for i in range(self.n_mat): results[i] = self.view.apply_async(eig, self.matrix[i,:,:]) for i in range(self.n_mat): results[i] = results[i].get() def timer(self, func): t = clock() func() return clock() - t def run(self, func, n_engines, n_mkl, view_method): self.view = view_method(range(n_engines)) self.view.apply_sync(set_num_threads, n_mkl) set_num_threads(n_mkl) return min(self.timer(func) for _ in range(self.repeat)) def run_all(self): funcs = self.ip_map, self.parfor n_engines = 2, 4 n_mkls = 1, 2 views = self.rc.direct_view, self.rc.load_balanced_view times = [] for n_mkl in n_mkls: args = self.map, 0, n_mkl, views[0] times.append(self.run(*args)) for args in product(funcs, n_engines, n_mkls, views): times.append(self.run(*args)) return times 

I don’t know if this is important, but to run the 4 parallel IPython mechanisms that I entered on the command line:

 ipcluster start -n 4 

Hope this helps :)

+3
source

Source: https://habr.com/ru/post/1493150/


All Articles