Python concurrent.futures.ProcessPoolExecutor: Performance .submit () vs .map ()

Question

Python concurrent.futures.ProcessPoolExecutor: Performance .submit () vs .map ()

I use concurrent.futures.ProcessPoolExecutor to find the appearance of a number from a range of numbers. The goal is to investigate the amount of accelerated performance gained from concurrency. To evaluate performance, I have control - serial code to perform the specified task (shown below). I wrote two parallel codes, one using concurrent.futures.ProcessPoolExecutor.submit() and the other using concurrent.futures.ProcessPoolExecutor.map() to accomplish the same task. They are shown below. Tips for compiling the first and last can be seen here and here respectively.

The task given to all three codes was to find the number of occurrences of the number 5 in the range of numbers from 0 to 1E8. Both .submit() and .map() were assigned 6 workers, and .map() - 8,000 pieces. The way workload was discretized was identical in parallel codes. However, the function used to search for occurrences in both codes is different. This was due to the fact that the arguments of the method were passed to the function called .submit () and .map (), were different.

All 3 codes reported the same number of cases, i.e. 56,953,279 times. However, the time taken to complete the task was completely different. .submit() is 2 times faster than the control, and .map() takes twice as much time as the control to complete the task.

Questions:

I would like to know if .map() slow performance is an artifact of my coding or is it essentially slow? "If the first one, how can I improve it. I am simply surprised that it was carried out more slowly than the control, since there will not be much incentive to use it.
I like to know if there is a way to make .submit() code even faster. The condition I have is that the _concurrent_submit() function should return iterable with numbers / occurrences containing the number 5.

Test results

concurrent.futures.ProcessPoolExecutor.submit ()

 #!/usr/bin/python3.5 # -*- coding: utf-8 -*- import concurrent.futures as cf from time import time from traceback import print_exc def _findmatch(nmin, nmax, number): '''Function to find the occurrence of number in range nmin to nmax and return the found occurrences in a list.''' print('\n def _findmatch', nmin, nmax, number) start = time() match=[] for n in range(nmin, nmax): if number in str(n): match.append(n) end = time() - start print("found {0} in {1:.4f}sec".format(len(match),end)) return match def _concurrent_submit(nmax, number, workers): '''Function that utilises concurrent.futures.ProcessPoolExecutor.submit to find the occurences of a given number in a number range in a parallelised manner.''' # 1. Local variables start = time() chunk = nmax // workers futures = [] found =[] #2. Parallelization with cf.ProcessPoolExecutor(max_workers=workers) as executor: # 2.1. Discretise workload and submit to worker pool for i in range(workers): cstart = chunk * i cstop = chunk * (i + 1) if i != workers - 1 else nmax futures.append(executor.submit(_findmatch, cstart, cstop, number)) # 2.2. Instruct workers to process results as they come, when all are # completed or ..... cf.as_completed(futures) # faster than cf.wait() # 2.3. Consolidate result as a list and return this list. for future in futures: for f in future.result(): try: found.append(f) except: print_exc() foundsize = len(found) end = time() - start print('within statement of def _concurrent_submit():') print("found {0} in {1:.4f}sec".format(foundsize, end)) return found if __name__ == '__main__': nmax = int(1E8) # Number range maximum. number = str(5) # Number to be found in number range. workers = 6 # Pool of workers start = time() a = _concurrent_submit(nmax, number, workers) end = time() - start print('\n main') print('workers = ', workers) print("found {0} in {1:.4f}sec".format(len(a),end))

concurrent.futures.ProcessPoolExecutor.map ()

 #!/usr/bin/python3.5 # -*- coding: utf-8 -*- import concurrent.futures as cf import itertools from time import time from traceback import print_exc def _findmatch(listnumber, number): '''Function to find the occurrence of number in another number and return a string value.''' #print('def _findmatch(listnumber, number):') #print('listnumber = {0} and ref = {1}'.format(listnumber, number)) if number in str(listnumber): x = listnumber #print('x = {0}'.format(x)) return x def _concurrent_map(nmax, number, workers): '''Function that utilises concurrent.futures.ProcessPoolExecutor.map to find the occurrences of a given number in a number range in a parallelised manner.''' # 1. Local variables start = time() chunk = nmax // workers futures = [] found =[] #2. Parallelization with cf.ProcessPoolExecutor(max_workers=workers) as executor: # 2.1. Discretise workload and submit to worker pool for i in range(workers): cstart = chunk * i cstop = chunk * (i + 1) if i != workers - 1 else nmax numberlist = range(cstart, cstop) futures.append(executor.map(_findmatch, numberlist, itertools.repeat(number), chunksize=10000)) # 2.3. Consolidate result as a list and return this list. for future in futures: for f in future: if f: try: found.append(f) except: print_exc() foundsize = len(found) end = time() - start print('within statement of def _concurrent(nmax, number):') print("found {0} in {1:.4f}sec".format(foundsize, end)) return found if __name__ == '__main__': nmax = int(1E8) # Number range maximum. number = str(5) # Number to be found in number range. workers = 6 # Pool of workers start = time() a = _concurrent_map(nmax, number, workers) end = time() - start print('\n main') print('workers = ', workers) print("found {0} in {1:.4f}sec".format(len(a),end))

Serial Code:

 #!/usr/bin/python3.5 # -*- coding: utf-8 -*- from time import time def _serial(nmax, number): start = time() match=[] nlist = range(nmax) for n in nlist: if number in str(n):match.append(n) end=time()-start print("found {0} in {1:.4f}sec".format(len(match),end)) return match if __name__ == '__main__': nmax = int(1E8) # Number range maximum. number = str(5) # Number to be found in number range. start = time() a = _serial(nmax, number) end = time() - start print('\n main') print("found {0} in {1:.4f}sec".format(len(a),end))

February 13, 2017 update:

In addition to @niemmi's answer, I provide an answer after some personal research to show:

how to speed up solutions @ niemmi.map () and .submit () and
when ProcessPoolExecutor.map () can lead to more speedup than ProcessPoolExecutor.submit ().

+6

performance python python-3.x concurrency concurrent.futures

Sun bear Feb 06 '17 at 18:12

source share

2 answers

You compare apples to oranges here. When using map you produce all the numbers 1E8 and transfer them to workflows. This takes a lot of time compared to the actual execution. When using submit you simply create 6 sets of parameters that are passed.

If you modify map to work on the same principle, you will get numbers close to each other:

 def _findmatch(nmin, nmax, number): '''Function to find the occurrence of number in range nmin to nmax and return the found occurrences in a list.''' print('\n def _findmatch', nmin, nmax, number) start = time() match=[] for n in range(nmin, nmax): if number in str(n): match.append(n) end = time() - start print("found {0} in {1:.4f}sec".format(len(match),end)) return match def _concurrent_map(nmax, number, workers): '''Function that utilises concurrent.futures.ProcessPoolExecutor.map to find the occurrences of a given number in a number range in a parallelised manner.''' # 1. Local variables start = time() chunk = nmax // workers futures = [] found =[] #2. Parallelization with cf.ProcessPoolExecutor(max_workers=workers) as executor: # 2.1. Discretise workload and submit to worker pool cstart = (chunk * i for i in range(workers)) cstop = (chunk * i if i != workers else nmax for i in range(1, workers + 1)) futures = executor.map(_findmatch, cstart, cstop, itertools.repeat(number)) # 2.3. Consolidate result as a list and return this list. for future in futures: for f in future: try: found.append(f) except: print_exc() foundsize = len(found) end = time() - start print('within statement of def _concurrent(nmax, number):') print("found {0} in {1:.4f}sec".format(foundsize, end)) return found

You can improve send performance by using as_completed correctly. For this iterative futures, it will return an iterator, which will be the yield futures in the order in which they will be completed.

You can also skip copying data to another array and use itertools.chain.from_iterable to combine the results with futures with one iterable:

 import concurrent.futures as cf import itertools from time import time from traceback import print_exc from itertools import chain def _findmatch(nmin, nmax, number): '''Function to find the occurrence of number in range nmin to nmax and return the found occurrences in a list.''' print('\n def _findmatch', nmin, nmax, number) start = time() match=[] for n in range(nmin, nmax): if number in str(n): match.append(n) end = time() - start print("found {0} in {1:.4f}sec".format(len(match),end)) return match def _concurrent_map(nmax, number, workers): '''Function that utilises concurrent.futures.ProcessPoolExecutor.map to find the occurrences of a given number in a number range in a parallelised manner.''' # 1. Local variables chunk = nmax // workers futures = [] found =[] #2. Parallelization with cf.ProcessPoolExecutor(max_workers=workers) as executor: # 2.1. Discretise workload and submit to worker pool for i in range(workers): cstart = chunk * i cstop = chunk * (i + 1) if i != workers - 1 else nmax futures.append(executor.submit(_findmatch, cstart, cstop, number)) return chain.from_iterable(f.result() for f in cf.as_completed(futures)) if __name__ == '__main__': nmax = int(1E8) # Number range maximum. number = str(5) # Number to be found in number range. workers = 6 # Pool of workers start = time() a = _concurrent_map(nmax, number, workers) end = time() - start print('\n main') print('workers = ', workers) print("found {0} in {1:.4f}sec".format(sum(1 for x in a),end))

+2

niemmi Feb 07 '17 at 4:41

source share

Sun bear · Accepted Answer · 2017-02-07T18:11:13+0000

Overview:

There are two parts to my answer:

Part 1 shows how to get more acceleration from the @niemmi ProcessPoolExecutor.map() solution.
Part 2 shows when subclasses of ProcessPoolExecutor .submit() and .map() give non-equivalent computation times.

============================================= ====== =======================

Part 1: Acceleration for ProcessPoolExecutor.map ()

Background: This section is based on the @niemmi .map() solution, which in itself is excellent. Having done some research on my sampling scheme, to better understand how this interacts with .map (), argues the argument, I found this interesting solution.

I believe that the definition of @niemmi chunk = nmax // workers is the definition for chunksize, that is, the smaller size of the actual range of values (the given task) that every worker in the workers pool must solve. Now this definition is based on the assumption that if there are x number of workers on the computer, dividing the task equally between each employee will lead to the optimal use of each worker, and, therefore, the general task will be completed most quickly. Therefore, the number of pieces to decompose a given task should always be equal to the number of pool workers. However, is this assumption correct?

Suggestion: Here I propose that the above assumption does not always lead to the fastest calculation time when used with ProcessPoolExecutor.map() . Rather, discretizing a task by an amount greater than the number of pool workers can lead to acceleration, i.e. Faster completion of a given task .

Experiment: I modified the @niemmi code so that the number of discretized tasks exceeds the number of pool workers. This code is given below and is used to limit the number of times that the number 5 appears in the range of numbers from 0 to 1E8. I executed this code using 1, 2, 4, and 6 pool workers and for different ratios of the number of discretized tasks and the number of pool workers. For each scenario, 3 runs were performed, and the calculation time was tabulated. "Acceleration" is defined here as the average calculation time using an equal number of pieces and pool workers during the average calculation time, when the number of discretized tasks is greater than the number of pool workers.

Conclusions:

The figure on the left shows the calculation time for all the scenarios mentioned in the experiment section. This shows that the time calculated by the number of pieces / number of workers = 1 is always longer than the time to calculate the number of chunks> the number of workers. That is, the first case is always less effective than the last.
The figure on the right shows that an acceleration of 1.2 times or more was achieved when the number of pieces / number of workers reached a threshold of 14 or more . It is interesting to note that the acceleration trend also occurred when ProcessPoolExecutor.map() was executed with 1 worker.

Conclusion: When setting the number of discrete tasks that ProcessPoolExecutor.map () `should use to solve this problem, it is reasonable to ensure that this number is greater than the number of workers in the number pool, as this practice reduces the calculation time.

concurrent.futures.ProcessPoolExecutor.map (). (only for revised parts)

 def _concurrent_map(nmax, number, workers, num_of_chunks): '''Function that utilises concurrent.futures.ProcessPoolExecutor.map to find the occurrences of a given number in a number range in a parallelised manner.''' # 1. Local variables start = time() chunksize = nmax // num_of_chunks futures = [] found =[] #2. Parallelization with cf.ProcessPoolExecutor(max_workers=workers) as executor: # 2.1. Discretise workload and submit to worker pool cstart = (chunksize * i for i in range(num_of_chunks)) cstop = (chunksize * i if i != num_of_chunks else nmax for i in range(1, num_of_chunks + 1)) futures = executor.map(_findmatch, cstart, cstop, itertools.repeat(number)) # 2.2. Consolidate result as a list and return this list. for future in futures: #print('type(future)=',type(future)) for f in future: if f: try: found.append(f) except: print_exc() foundsize = len(found) end = time() - start print('\n within statement of def _concurrent(nmax, number):') print("found {0} in {1:.4f}sec".format(foundsize, end)) return found if __name__ == '__main__': nmax = int(1E8) # Number range maximum. number = str(5) # Number to be found in number range. workers = 4 # Pool of workers chunks_vs_workers = 14 # A factor of =>14 can provide optimum performance num_of_chunks = chunks_vs_workers * workers start = time() a = _concurrent_map(nmax, number, workers, num_of_chunks) end = time() - start print('\n main') print('nmax={}, workers={}, num_of_chunks={}'.format( nmax, workers, num_of_chunks)) print('workers = ', workers) print("found {0} in {1:.4f}sec".format(len(a),end))

============================================= ====== =======================

Part 2: The total calculation time using subclasses of ProcessPoolExecutor.submit () and .map () may not be the same when returning a sorted / ordered list of results.

Background: I fixed the .submit() and .map() codes to compare the comparison of their calculation times with the "apple-apple" and the ability to visualize the calculation time of the main code, the calculation time of the _concurrent method called by the main code for parallel operations and the time calculations for each discretized task / worker called by the _concurrent method. In addition, the parallel method in these codes was structured to return an unordered and ordered list of results directly from the future .submit() object and the .map() iterator. The source code is below (I hope it helps you.).

<strong> Experiments These two new improved codes were used to perform the same experiment described in Part 1, except that only 6 pool workers were considered, and the python list and sorted built-in methods were used to return the unordered and ordered list of results to main section of code, respectively.

Conclusions:

From the result of the _concurrent method, we can see the calculation time of the _concurrent method used to create all Future objects from ProcessPoolExecutor.submit() , and create an iterator ProcessPoolExecutor.map() depending on the number of sampled tasks by the number of pool workers, equivalent. This result simply means that subclasses of ProcessPoolExecutor .submit() and .map() equally efficient / fast.
Comparing the calculation time from the main and its _concurrent methods, we see that the main function works longer than the _concurrent method. This is to be expected, since their time difference reflects the amount of time it sorted calculate the list and sorted (and other methods enclosed in these methods). Obviously, the list method took less time to return a list of results than the sorted method. The average calculation time of the list method for .submit () and .map () codes was the same at ~ 0.47 sec. The average calculation time for the sorted method for .submit () and .map () codes was 1.23 s and 1.01 sec, respectively. In other words, the list method ran 2.62 times and 2.15 times faster than the sorted method for .submit () and .map () codes, respectively.
It is not clear why the sorted method created an ordered list from .map() faster than from .submit() , since the number of discretized tasks increased more than the number of workers in the pool, unless the number of discretized tasks was equal to the number of workers in the pool. However, this data indicates that the decision to use equally fast subclasses of .submit() or .map() can be encumbered with a sorted method. For example, if the goal is to create an ordered list in the shortest possible time, using ProcessPoolExecutor.map () should be preferable to ProcessPoolExecutor.submit() , since .map() can allow the shortest total computation time.
The sampling scheme mentioned in part 1 of my answer is shown here to speed up the subclasses of .submit() and .map() . The amount of acceleration can reach 20% in the case when the number of discretized tasks is equal to the number of pool workers.

Improved .map () code

 #!/usr/bin/python3.5 # -*- coding: utf-8 -*- import concurrent.futures as cf from time import time from itertools import repeat, chain def _findmatch(nmin, nmax, number): '''Function to find the occurence of number in range nmin to nmax and return the found occurences in a list.''' start = time() match=[] for n in range(nmin, nmax): if number in str(n): match.append(n) end = time() - start #print("\n def _findmatch {0:<10} {1:<10} {2:<3} found {3:8} in {4:.4f}sec". # format(nmin, nmax, number, len(match),end)) return match def _concurrent(nmax, number, workers, num_of_chunks): '''Function that utilises concurrent.futures.ProcessPoolExecutor.map to find the occurrences of a given number in a number range in a concurrent manner.''' # 1. Local variables start = time() chunksize = nmax // num_of_chunks #2. Parallelization with cf.ProcessPoolExecutor(max_workers=workers) as executor: # 2.1. Discretise workload and submit to worker pool cstart = (chunksize * i for i in range(num_of_chunks)) cstop = (chunksize * i if i != num_of_chunks else nmax for i in range(1, num_of_chunks + 1)) futures = executor.map(_findmatch, cstart, cstop, repeat(number)) end = time() - start print('\n within statement of def _concurrent_map(nmax, number, workers, num_of_chunks):') print("found in {0:.4f}sec".format(end)) return list(chain.from_iterable(futures)) #Return an unordered result list #return sorted(chain.from_iterable(futures)) #Return an ordered result list if __name__ == '__main__': nmax = int(1E8) # Number range maximum. number = str(5) # Number to be found in number range. workers = 6 # Pool of workers chunks_vs_workers = 30 # A factor of =>14 can provide optimum performance num_of_chunks = chunks_vs_workers * workers start = time() found = _concurrent(nmax, number, workers, num_of_chunks) end = time() - start print('\n main') print('nmax={}, workers={}, num_of_chunks={}'.format( nmax, workers, num_of_chunks)) #print('found = ', found) print("found {0} in {1:.4f}sec".format(len(found),end))

Improved .submit () code.
This code is the same as .map code, except that you replace the _concurrent method as follows:

 def _concurrent(nmax, number, workers, num_of_chunks): '''Function that utilises concurrent.futures.ProcessPoolExecutor.submit to find the occurrences of a given number in a number range in a concurrent manner.''' # 1. Local variables start = time() chunksize = nmax // num_of_chunks futures = [] #2. Parallelization with cf.ProcessPoolExecutor(max_workers=workers) as executor: # 2.1. Discretise workload and submit to worker pool for i in range(num_of_chunks): cstart = chunksize * i cstop = chunksize * (i + 1) if i != num_of_chunks - 1 else nmax futures.append(executor.submit(_findmatch, cstart, cstop, number)) end = time() - start print('\n within statement of def _concurrent_submit(nmax, number, workers, num_of_chunks):') print("found in {0:.4f}sec".format(end)) return list(chain.from_iterable(f.result() for f in cf.as_completed( futures))) #Return an unordered list #return list(chain.from_iterable(f.result() for f in cf.as_completed( # futures))) #Return an ordered list

============================================= ====== =======================

Python concurrent.futures.ProcessPoolExecutor: Performance .submit () vs .map ()

More articles: