Increase employment without acceleration in FFT

Problem

I need to calculate many Fourier transforms. I would like to do this in parallel with my cores. Note that I do not need a parallel FFT algorithm, I just want to run many embarrassing parallel FFTs.

I found that while my CPU usage is increasing, my time to completion is not decreasing.

Example

We create some random data.

In [1]: import numpy as np In [2]: x = np.random.random(10000000) # some random data 

And the time required to calculate the FFT of both cold, and after calculating it once.

 In [3]: %time _ = np.fft.rfft(x) # cost of one run CPU times: user 589 ms, sys: 23.9 ms, total: 612 ms Wall time: 613 ms In [4]: %time _ = np.fft.rfft(x) # there is some speedup from mulitple runs CPU times: user 365 ms, sys: 12.4 ms, total: 378 ms Wall time: 381 ms 

We execute this sequence of sequences

 In [5]: %time _ = map(np.fft.rfft, [x] * 12) # many runs sequentially CPU times: user 4.4 s, sys: 135 ms, total: 4.54 s Wall time: 4.54 s In [6]: 4.54 / 12 # Same cost per FFT Out[6]: 0.37833333333333335 

We do the same, but now use a thread pool of four threads.

 In [7]: from multiprocessing.pool import ThreadPool In [8]: pool = ThreadPool(4) # I have four physical cores In [9]: %time _ = pool.map(np.fft.rfft, [x] * 12) CPU times: user 15.5 s, sys: 1.3 s, total: 16.8 s Wall time: 4.79 s 

We find that there is no acceleration. However, we find that CPU utilization, as measured by top , is around 400%. This is not a problem with the GIL. There is something about FFT that does not parallelize. Maybe we break through higher-level caches?

Hardware: Intel (R) Core (TM) i5-3320M CPU @ 2.60 GHz

Question

Typically, what happens here and is there a way to use multiple cores to speed up parallel parallel parallel computing?

+5
source share
1 answer

On my workstation, ThreadPool does acceleration (though not perfect):

 In [42]: x = np.random.random(2**23) In [43]: %time _ = list(map(np.fft.rfft, [x]*12)) CPU times: user 3.32 s, sys: 380 ms, total: 3.7 s Wall time: 3.7 s In [44]: tpool = ThreadPool(4) In [45]: %time _ = list(tpool.map(np.fft.rfft, [x]*12)) CPU times: user 5.4 s, sys: 596 ms, total: 6 s Wall time: 1.62 s In [46]: 3.7/4 Out[46]: 0.925 

I am using Python3, so maybe there is something there? Otherwise, it is most likely hardware. FFTs are memory-related, so it’s possible that one thread will saturate your memory system. You may be able to improve the location of the memory system by dropping it to an environment that allows you to control proximity.

Equipment

Intel (R) Core (TM) i7-4930K CPU @ 3.40 GHz.

+2
source

Source: https://habr.com/ru/post/1232410/


All Articles