Expected acceleration from an embarrassing parallel task using Python multiprocessing

I am studying the use of the Pipon Multiprocessing package for embarrassing parallel problems, so I wrote serial and parallel versions to determine the number of primes less than or equal to the natural number n. Based on what I read from the message and the question about stack overflow , I got the following code:

Consistent

import math import time def is_prime(start, end): """determine how many primes within given range""" numPrime = 0 for n in range(start, end+1): isPrime = True for i in range(2, math.floor(math.sqrt(n))+1): if n % i == 0: isPrime = False break if isPrime: numPrime += 1 if start == 1: numPrime -= 1 # since 1 is not prime return numPrime if __name__ == "__main__": natNum = 0 while natNum < 2: natNum = int(input('Enter a natural number greater than 1: ')) startTime = time.time() finalResult = is_prime(1, natNum) print('Elapsed time:', time.time()-startTime, 'seconds') print('The number of primes <=', natNum, 'is', finalResult) 

Parallel

 import math import multiprocessing as mp import numpy import time def is_prime(vec, output): """determine how many primes in vector""" numPrime = 0 for n in vec: isPrime = True for i in range(2, math.floor(math.sqrt(n))+1): if n % i == 0: isPrime = False break if isPrime: numPrime += 1 if vec[0] == 1: numPrime -= 1 # since 1 is not prime output.put(numPrime) def chunks(vec, n): """evenly divide list into n chunks""" for i in range(0, len(vec), n): yield vec[i:i+n] if __name__ == "__main__": natNum = 0 while natNum < 2: natNum = int(input('Enter a natural number greater than 1: ')) numProc = 0 while numProc < 1: numProc = int(input('Enter the number of desired parallel processes: ')) startTime = time.time() numSplits = math.ceil(natNum/numProc) splitList = list(chunks(tuple(range(1, natNum+1)), numSplits)) output = mp.Queue() processes = [mp.Process(target=is_prime, args=(splitList[jobID], output)) for jobID in range(numProc)] for p in processes: p.start() for p in processes: p.join() print('Elapsed time:', time.time()-startTime, 'seconds') procResults = [output.get() for p in processes] finalResult = numpy.sum(numpy.array(procResults)) print('Results from each process:\n', procResults) print('The number of primes <=', natNum, 'is', finalResult) 

Here is what I get for n = 10000000 (for a parallel request of 8 processes):

 $ python serial_prime_test.py Enter a natural number greater than 1: 10000000 Elapsed time: 162.1960825920105 seconds The number of primes <= 10000000 is 664579 $ python parallel_prime_test.py Enter a natural number greater than 1: 10000000 Enter the number of desired parallel processes: 8 Elapsed time: 49.41204643249512 seconds Results from each process: [96469, 86603, 83645, 80303, 81796, 79445, 78589, 77729] The number of primes <= 10000000 is 664579 

So it looks like I can get a little over 3 times. Here are my questions :

  • Clearly, this is not linear acceleration, so how much better can I (or what acceleration should I expect realistic)?
  • Amdahl Law seems to be responding to this, but I don’t know how to determine which part of my program is strictly consistent.

Any help is appreciated.

Edit: There are 4 physical cores capable of hyperthreading.

+5
source share
2 answers

I think you want to split the work differently.

Although your program divides the range of candidate integers evenly across all cores, working in each range is unlikely to be even. This means that some kernels end earlier, and they have nothing to do, while others still work. This speeds up parallel efficiency.

Just to say, imagine you have 1000 cores. The first core sees very small numbers of candidates and does not take much time to factor them, and then stands idle. The last (thousandth) core sees only very large numbers of candidates and takes much longer to separate them. Thus, it starts while the first core sits idle. Spent cycles. Similarly for 4 cores.

What you want to do when the amount of work transferred to the kernel is unknown is all the core cores with small sizes, much larger than the cores. Then the cores may end unevenly, and each core returns to find a little more work. This is essentially a worklist algorithm. You end up with unevenness at the very end, but it’s only in small pieces, so there’s not much wasted.

I am not a Python programmer, so instead I coded the solution in Parlanse.

 (includeunique `Console.par') (includeunique `Timer.par') (define upper_limit 10000000) (define candidates_per_segment 10) (define candidates_per_segment2 (constant (* candidates_per_segment 2))) (= [prime_count natural] 0) [prime_finding_team team] (define primes_in_segment (action (procedure [lower natural] [upper natural]) (;; (do [candidate natural] lower upper 2 (block test_primality (local (= [divisor natural] 3) (;; (while (< (* divisor divisor) candidate) (ifthenelse (== (modulo candidate divisor) 0) (exitblock test_primality) (+= divisor 2) )ifthenelse )while (ifthen (~= (* divisor divisor) candidate) (consume (atomic+= prime_count)) )ifthen );; )local )block )do );; )action )define (define main (action (procedure void) (local [timer Timer:Timer] (;; (Console:Put (. `Number of primes found: ')) (Timer:Reset (. timer)) (do [i natural] 1 upper_limit candidates_per_segment2 (consume (draft prime_finding_team primes_in_segment `lower':i `upper':(minimum upper_limit (- (+ i candidates_per_segment2) 2)))) )do (consume (wait (@ (event prime_finding_team)))) (Timer:Stop (. timer)) (Console:PutNatural prime_count) (Console:PutNewline) (Timer:PrintElapsedTime (. timer) (. `Parallel computed in ')) (Console:PutNewline) );; )local )action )define 

Parlanse looks like LISP, but it works and compiles more like C.

Worker - primes_in_segment; it accepts a range of candidate values ​​defined by its parameters below and above. He tries each candidate in this range and increases the (atomically) common simple_component if that candidate is simple.

The full range is split into small range packets (sequences of odd numbers) on a do loop basically. parallelism occurs in a project team that creates a parallel execution of the calculation (rather than a Windows thread) and adds it to the prime_finding_team file, which is an aggregate set of jobs representing all simple factoring. (The purpose of the team is to allow all this work to be managed as a unit, for example, to destroy, if necessary, is not required in this program). Arguments for a draft is a function performed by a branched grain and its parameters. Work is performed using a set of threads managed by Parlanse (Windows) using a work processing algorithm. If there is too much work, Parlanse throttles the producing grain and spends its energy on the grain, which is pure calculation.

Only one candidate value can be transferred to each grain, but then for each candidate there is more overhead costs for the plug, and the total execution time is correspondingly worsened. We selected 10 empirically to make sure that the overhead for the plug for each range of candidates is small; setting candidates for segments up to 1000 does not require additional acceleration.

The do loop just does the job as fast as possible. Parlanse throttles a draft when parallelism is enough to be useful. Waiting for a team event makes the main program wait for the completion of all team members.

We did this on a six-core AMD Phenom II X6 1090T processor with a clock frequency of 3.2 GHz. The execution is performed below; first for 1 processor:

  >run -p1 -v ..\teamprimes PARLANSE RTS: Version 19.1.53 # Processors = 1 Number of primes found: 664579 Parallel computed in 13.443294 seconds ---- PARLANSE RTS: Performance Statistics Duration = 13.527557 seconds. CPU Time Statistics: Kernel CPU Time: 0.031s User CPU Time: 13.432s Memory Statistics: Peak Memory Usage : 4.02 MBytes Steals: 0 Attempts: 0 Success rate: 0.0% Work Rediscovered: 0 Exiting with final status 0. 

Then for 6 processors (scales well):

 >run -p6 -v ..\teamprimes PARLANSE RTS: Version 19.1.53 # Processors = 6 Number of primes found: 664579 Parallel computed in 2.443123 seconds ---- PARLANSE RTS: Performance Statistics Duration = 2.538972 seconds. CPU Time Statistics: Kernel CPU Time: 0.000s User CPU Time: 14.102s Total CPU Time: 14.102s Memory Statistics: Peak Memory Usage : 4.28 MBytes Steals: 459817 Attempts: 487334 Success rate: 94.4% Work Rediscovered: 153 

Please note that the total CPU time for the parallel version is approximately the same as for the serial version; this is because they do the same job.

Given the Python actions of “fork” and “join,” I am sure there is a Python equivalent that you can easily code. This may result in space or streams due to the possibility of too many forks at the same time. (With candidates_per_segment at 10, up to 1 million live grains work before Paranse). This is where the automatic regulation of job generation is a good thing to have. As a substitute, you can set candidates_per_segment to a much larger number, for example, 10,000, which means that you get only 1,000 of the worst cases. (I think you will still pay a heavy price due to the interpretation of Python). When you set candidates one segment closer and closer to 1e7 / 4, you come to the exact behavior that you have with your current Python code.

+5
source

You will not get more parallelism than the number of cores / threads in your CPU. If you get 3x acceleration on a 4-core computer, this is very good. You only have a little overhead. I would suggest that on a 4-core computer you set the “number of parallel processes” to 4 to reduce overhead. Now, if you use it on a 16-core computer, the speed only 3 times seems low. I would look at the Python multiprocessing library, in particular how it works with threads (processes?).

What were your results with numProc == 4 ?

Amdahl Law is applied here, but only a small part of your parallel program is sequential (basically the main one), since the work is fairly evenly distributed over entire ranges.

+1
source

Source: https://habr.com/ru/post/1205472/


All Articles