Overview:
There are two parts to my answer:
- Part 1 shows how to get more acceleration from the @niemmi
ProcessPoolExecutor.map()
solution.
- Part 2 shows when subclasses of
ProcessPoolExecutor
.submit()
and .map()
give non-equivalent computation times.
============================================= ====== =======================
Part 1: Acceleration for ProcessPoolExecutor.map ()
Background: This section is based on the @niemmi .map()
solution, which in itself is excellent. Having done some research on my sampling scheme, to better understand how this interacts with .map (), argues the argument, I found this interesting solution.
I believe that the definition of @niemmi chunk = nmax // workers
is the definition for chunksize, that is, the smaller size of the actual range of values ββ(the given task) that every worker in the workers pool must solve. Now this definition is based on the assumption that if there are x number of workers on the computer, dividing the task equally between each employee will lead to the optimal use of each worker, and, therefore, the general task will be completed most quickly. Therefore, the number of pieces to decompose a given task should always be equal to the number of pool workers. However, is this assumption correct?
Suggestion: Here I propose that the above assumption does not always lead to the fastest calculation time when used with ProcessPoolExecutor.map()
. Rather, discretizing a task by an amount greater than the number of pool workers can lead to acceleration, i.e. Faster completion of a given task .
Experiment: I modified the @niemmi code so that the number of discretized tasks exceeds the number of pool workers. This code is given below and is used to limit the number of times that the number 5 appears in the range of numbers from 0 to 1E8. I executed this code using 1, 2, 4, and 6 pool workers and for different ratios of the number of discretized tasks and the number of pool workers. For each scenario, 3 runs were performed, and the calculation time was tabulated. "Acceleration" is defined here as the average calculation time using an equal number of pieces and pool workers during the average calculation time, when the number of discretized tasks is greater than the number of pool workers.
Conclusions:

The figure on the left shows the calculation time for all the scenarios mentioned in the experiment section. This shows that the time calculated by the number of pieces / number of workers = 1 is always longer than the time to calculate the number of chunks> the number of workers. That is, the first case is always less effective than the last.
The figure on the right shows that an acceleration of 1.2 times or more was achieved when the number of pieces / number of workers reached a threshold of 14 or more . It is interesting to note that the acceleration trend also occurred when ProcessPoolExecutor.map()
was executed with 1 worker.
Conclusion: When setting the number of discrete tasks that ProcessPoolExecutor.map () `should use to solve this problem, it is reasonable to ensure that this number is greater than the number of workers in the number pool, as this practice reduces the calculation time.
concurrent.futures.ProcessPoolExecutor.map (). (only for revised parts)
def _concurrent_map(nmax, number, workers, num_of_chunks): '''Function that utilises concurrent.futures.ProcessPoolExecutor.map to find the occurrences of a given number in a number range in a parallelised manner.'''
============================================= ====== =======================
Part 2: The total calculation time using subclasses of ProcessPoolExecutor.submit () and .map () may not be the same when returning a sorted / ordered list of results.
Background: I fixed the .submit()
and .map()
codes to compare the comparison of their calculation times with the "apple-apple" and the ability to visualize the calculation time of the main code, the calculation time of the _concurrent method called by the main code for parallel operations and the time calculations for each discretized task / worker called by the _concurrent method. In addition, the parallel method in these codes was structured to return an unordered and ordered list of results directly from the future .submit()
object and the .map()
iterator. The source code is below (I hope it helps you.).
<strong> Experiments These two new improved codes were used to perform the same experiment described in Part 1, except that only 6 pool workers were considered, and the python list
and sorted
built-in methods were used to return the unordered and ordered list of results to main section of code, respectively.
Conclusions: 
- From the result of the _concurrent method, we can see the calculation time of the _concurrent method used to create all Future objects from
ProcessPoolExecutor.submit()
, and create an iterator ProcessPoolExecutor.map()
depending on the number of sampled tasks by the number of pool workers, equivalent. This result simply means that subclasses of ProcessPoolExecutor
.submit()
and .map()
equally efficient / fast. - Comparing the calculation time from the main and its _concurrent methods, we see that the main function works longer than the _concurrent method. This is to be expected, since their time difference reflects the amount of time it
sorted
calculate the list
and sorted
(and other methods enclosed in these methods). Obviously, the list
method took less time to return a list of results than the sorted
method. The average calculation time of the list
method for .submit () and .map () codes was the same at ~ 0.47 sec. The average calculation time for the sorted method for .submit () and .map () codes was 1.23 s and 1.01 sec, respectively. In other words, the list
method ran 2.62 times and 2.15 times faster than the sorted
method for .submit () and .map () codes, respectively. - It is not clear why the
sorted
method created an ordered list from .map()
faster than from .submit()
, since the number of discretized tasks increased more than the number of workers in the pool, unless the number of discretized tasks was equal to the number of workers in the pool. However, this data indicates that the decision to use equally fast subclasses of .submit()
or .map()
can be encumbered with a sorted method. For example, if the goal is to create an ordered list in the shortest possible time, using ProcessPoolExecutor.map () should be preferable to ProcessPoolExecutor.submit()
, since .map()
can allow the shortest total computation time. - The sampling scheme mentioned in part 1 of my answer is shown here to speed up the subclasses of
.submit()
and .map()
. The amount of acceleration can reach 20% in the case when the number of discretized tasks is equal to the number of pool workers.
Improved .map () code
#!/usr/bin/python3.5 # -*- coding: utf-8 -*- import concurrent.futures as cf from time import time from itertools import repeat, chain def _findmatch(nmin, nmax, number): '''Function to find the occurence of number in range nmin to nmax and return the found occurences in a list.''' start = time() match=[] for n in range(nmin, nmax): if number in str(n): match.append(n) end = time() - start #print("\n def _findmatch {0:<10} {1:<10} {2:<3} found {3:8} in {4:.4f}sec". # format(nmin, nmax, number, len(match),end)) return match def _concurrent(nmax, number, workers, num_of_chunks): '''Function that utilises concurrent.futures.ProcessPoolExecutor.map to find the occurrences of a given number in a number range in a concurrent manner.''' # 1. Local variables start = time() chunksize = nmax
Improved .submit () code.
This code is the same as .map code, except that you replace the _concurrent method as follows:
def _concurrent(nmax, number, workers, num_of_chunks): '''Function that utilises concurrent.futures.ProcessPoolExecutor.submit to find the occurrences of a given number in a number range in a concurrent manner.'''
============================================= ====== =======================