I think you want to split the work differently.
Although your program divides the range of candidate integers evenly across all cores, working in each range is unlikely to be even. This means that some kernels end earlier, and they have nothing to do, while others still work. This speeds up parallel efficiency.
Just to say, imagine you have 1000 cores. The first core sees very small numbers of candidates and does not take much time to factor them, and then stands idle. The last (thousandth) core sees only very large numbers of candidates and takes much longer to separate them. Thus, it starts while the first core sits idle. Spent cycles. Similarly for 4 cores.
What you want to do when the amount of work transferred to the kernel is unknown is all the core cores with small sizes, much larger than the cores. Then the cores may end unevenly, and each core returns to find a little more work. This is essentially a worklist algorithm. You end up with unevenness at the very end, but it’s only in small pieces, so there’s not much wasted.
I am not a Python programmer, so instead I coded the solution in Parlanse.
(includeunique `Console.par') (includeunique `Timer.par') (define upper_limit 10000000) (define candidates_per_segment 10) (define candidates_per_segment2 (constant (* candidates_per_segment 2))) (= [prime_count natural] 0) [prime_finding_team team] (define primes_in_segment (action (procedure [lower natural] [upper natural]) (;; (do [candidate natural] lower upper 2 (block test_primality (local (= [divisor natural] 3) (;; (while (< (* divisor divisor) candidate) (ifthenelse (== (modulo candidate divisor) 0) (exitblock test_primality) (+= divisor 2) )ifthenelse )while (ifthen (~= (* divisor divisor) candidate) (consume (atomic+= prime_count)) )ifthen );; )local )block )do );; )action )define (define main (action (procedure void) (local [timer Timer:Timer] (;; (Console:Put (. `Number of primes found: ')) (Timer:Reset (. timer)) (do [i natural] 1 upper_limit candidates_per_segment2 (consume (draft prime_finding_team primes_in_segment `lower':i `upper':(minimum upper_limit (- (+ i candidates_per_segment2) 2)))) )do (consume (wait (@ (event prime_finding_team)))) (Timer:Stop (. timer)) (Console:PutNatural prime_count) (Console:PutNewline) (Timer:PrintElapsedTime (. timer) (. `Parallel computed in ')) (Console:PutNewline) );; )local )action )define
Parlanse looks like LISP, but it works and compiles more like C.
Worker - primes_in_segment; it accepts a range of candidate values defined by its parameters below and above. He tries each candidate in this range and increases the (atomically) common simple_component if that candidate is simple.
The full range is split into small range packets (sequences of odd numbers) on a do loop basically. parallelism occurs in a project team that creates a parallel execution of the calculation (rather than a Windows thread) and adds it to the prime_finding_team file, which is an aggregate set of jobs representing all simple factoring. (The purpose of the team is to allow all this work to be managed as a unit, for example, to destroy, if necessary, is not required in this program). Arguments for a draft is a function performed by a branched grain and its parameters. Work is performed using a set of threads managed by Parlanse (Windows) using a work processing algorithm. If there is too much work, Parlanse throttles the producing grain and spends its energy on the grain, which is pure calculation.
Only one candidate value can be transferred to each grain, but then for each candidate there is more overhead costs for the plug, and the total execution time is correspondingly worsened. We selected 10 empirically to make sure that the overhead for the plug for each range of candidates is small; setting candidates for segments up to 1000 does not require additional acceleration.
The do loop just does the job as fast as possible. Parlanse throttles a draft when parallelism is enough to be useful. Waiting for a team event makes the main program wait for the completion of all team members.
We did this on a six-core AMD Phenom II X6 1090T processor with a clock frequency of 3.2 GHz. The execution is performed below; first for 1 processor:
>run -p1 -v ..\teamprimes PARLANSE RTS: Version 19.1.53
Then for 6 processors (scales well):
>run -p6 -v ..\teamprimes PARLANSE RTS: Version 19.1.53
Please note that the total CPU time for the parallel version is approximately the same as for the serial version; this is because they do the same job.
Given the Python actions of “fork” and “join,” I am sure there is a Python equivalent that you can easily code. This may result in space or streams due to the possibility of too many forks at the same time. (With candidates_per_segment at 10, up to 1 million live grains work before Paranse). This is where the automatic regulation of job generation is a good thing to have. As a substitute, you can set candidates_per_segment to a much larger number, for example, 10,000, which means that you get only 1,000 of the worst cases. (I think you will still pay a heavy price due to the interpretation of Python). When you set candidates one segment closer and closer to 1e7 / 4, you come to the exact behavior that you have with your current Python code.