Parallel to Haskell, to find huge divisors

Question

Parallel to Haskell, to find huge divisors

I wrote the following program using Parallel Haskell to find 1 billion dividers.

import Control.Parallel parfindDivisors :: Integer->[Integer] parfindDivisors n = f1 `par` (f2 `par` (f1 ++ f2)) where f1=filter g [1..(quot n 4)] f2=filter g [(quot n 4)+1..(quot n 2)] gz = n `rem` z == 0 main = print (parfindDivisors 1000000000)

I compiled the program using ghc -rtsopts -threaded findDivisors.hs and I ran it with: findDivisors.exe +RTS -s -N2 -RTS

I found 50% acceleration compared to the simple version, which is as follows:

 findDivisors :: Integer->[Integer] findDivisors n = filter g [1..(quot n 2)] where gz = n `rem` z == 0

My processor is Intel's dual-core 2nd duet. I was wondering if there could be an improvement in the code above. Because the statistics in which the program is printed says: Parallel GC work balance: 1.01 (16940708 / 16772868, ideal 2) and SPARKS: 2 (1 converted, 0 overflowed, 0 dud, 0 GC'd, 1 fizzled) What is it for transformed , crowded , dud , gc'd , fizzled and how to help improve time.

+4

parallel-processing haskell

Dragno 18 sept. '12 at 9:36

source share

3 answers

I think this page explains this better than I could:

http://www.haskell.org/haskellwiki/ThreadScope_Tour/SparkOverview

I also found these slides interesting:

http://haskellwiki.gitit.net/Upload/HIW2011-Talk-Coutts.pdf

+1

stusmith 18 sept. '12 at 10:47

source share

By modifying the source code as follows, I speed up better, but this code, I think, cannot be parallelized

 findDivisors2 :: Integer->[Integer] findDivisors2 n= let firstDivs=[a|(a,b)<-[quotRem nx|x<-[1..sqrtn]],b==0] secondDivs=[n `quot` x|x<-firstDivs,x/=sqrtn] sqrtn = floor(sqrt (fromInteger n)) in firstDivs ++ secondDivs

I tried to parallelize the code as follows:

 parfindDivisors2 :: Integer->[Integer] parfindDivisors2 n= let firstDivs=[a|(a,b)<-[quotRem nx|x<-[1..sqrtn]],b==0] secondDivs=[n `quot` x|x<-firstDivs,x/=sqrtn] sqrtn = floor(sqrt (fromInteger n)) in secondDivs `par` firstDivs++secondDivs

Instead of reducing time, I doubled the time. I think this is because findDivisors2 have a strong dependency, whereas the first findDivisors algorithm findDivisors not work.

Any comments are welcome.

0

Dragno 18 sept. '12 at 20:28

source share

jtobin · Accepted Answer · 2012-09-18T18:10:34+0000

IMO, Par Monad helps talk about parallelism. This is a slightly higher level than with Par and pseq .

parfindDivisors are rewritten parfindDivisors using the Par Monad. Note that this is essentially the same as your algorithm:

 import Control.Monad.Par findDivisors :: Integer -> [Integer] findDivisors n = runPar $ do [f0, f1] <- sequence [new, new] fork $ put f0 (filter g [1..(quot n 4)]) fork $ put f1 (filter g [(quot n 4)+1..(quot n 2)]) [f0', f1'] <- sequence [get f0, get f1] return $ f0' ++ f1' where gz = n `rem` z == 0

Compiling with -O2 -threaded -rtsopts -eventlog and running with +RTS -N2 -s gives the following relevant runtime statistics:

  36,000,130,784 bytes allocated in the heap 3,165,440 bytes copied during GC 48,464 bytes maximum residency (1 sample(s)) Tot time (elapsed) Avg pause Max pause Gen 0 35162 colls, 35161 par 0.39s 0.32s 0.0000s 0.0006s Gen 1 1 colls, 1 par 0.00s 0.00s 0.0002s 0.0002s Parallel GC work balance: 1.32 (205296 / 155521, ideal 2) MUT time 42.68s ( 21.48s elapsed) GC time 0.39s ( 0.32s elapsed) Total time 43.07s ( 21.80s elapsed) Alloc rate 843,407,880 bytes per MUT second Productivity 99.1% of total user, 195.8% of total elapsed

Performance is very high. To slightly improve the balance of GC operation, we can increase the size of the GC distribution area; do +RTS -N2 -s -A128M , for example:

  36,000,131,336 bytes allocated in the heap 47,088 bytes copied during GC 49,808 bytes maximum residency (1 sample(s)) Tot time (elapsed) Avg pause Max pause Gen 0 135 colls, 134 par 0.19s 0.10s 0.0007s 0.0009s Gen 1 1 colls, 1 par 0.00s 0.00s 0.0010s 0.0010s Parallel GC work balance: 1.62 (2918 / 1801, ideal 2) MUT time 42.65s ( 21.49s elapsed) GC time 0.20s ( 0.10s elapsed) Total time 42.85s ( 21.59s elapsed) Alloc rate 843,925,806 bytes per MUT second Productivity 99.5% of total user, 197.5% of total elapsed

But it's really just nitpicking. The real story comes from ThreadScope:

The usage is essentially maximized for the two cores, therefore, significant parallelization may not occur (for two cores).

Some good notes about Par here .

UPDATE

Rewriting an alternative algorithm using Par looks something like this:

 findDivisors :: Integer -> [Integer] findDivisors n = let sqrtn = floor (sqrt (fromInteger n)) in runPar $ do [a, b] <- sequence [new, new] fork $ put a [a | (a, b) <- [quotRem nx | x <- [1..sqrtn]], b == 0] firstDivs <- get a fork $ put b [n `quot` x | x <- firstDivs, x /= sqrtn] secondDivs <- get b return $ firstDivs ++ secondDivs

But you are correct that this will not bring any benefit from parallelism due to the dependence on firstDivs .

Here you can enable parallelism by getting Strategies for evaluating elements of parallel lists. Sort of:

 import Control.Monad.Par import Control.Parallel.Strategies findDivisors :: Integer -> [Integer] findDivisors n = let sqrtn = floor (sqrt (fromInteger n)) in runPar $ do [a, b] <- sequence [new, new] fork $ put a ([a | (a, b) <- [quotRem nx | x <- [1..sqrtn]], b == 0] `using` parListChunk 2 rdeepseq) firstDivs <- get a fork $ put b ([n `quot` x | x <- firstDivs, x /= sqrtn] `using` parListChunk 2 rdeepseq) secondDivs <- get b return $ firstDivs ++ secondDivs

and doing this gives some statistics like

  3,388,800 bytes allocated in the heap 43,656 bytes copied during GC 68,032 bytes maximum residency (1 sample(s)) Tot time (elapsed) Avg pause Max pause Gen 0 5 colls, 4 par 0.00s 0.00s 0.0000s 0.0001s Gen 1 1 colls, 1 par 0.00s 0.00s 0.0002s 0.0002s Parallel GC work balance: 1.22 (2800 / 2290, ideal 2) MUT time (elapsed) GC time (elapsed) Task 0 (worker) : 0.01s ( 0.01s) 0.00s ( 0.00s) Task 1 (worker) : 0.01s ( 0.01s) 0.00s ( 0.00s) Task 2 (bound) : 0.01s ( 0.01s) 0.00s ( 0.00s) Task 3 (worker) : 0.01s ( 0.01s) 0.00s ( 0.00s) SPARKS: 50 (49 converted, 0 overflowed, 0 dud, 0 GC'd, 1 fizzled) MUT time 0.01s ( 0.00s elapsed) GC time 0.00s ( 0.00s elapsed) Total time 0.01s ( 0.01s elapsed) Alloc rate 501,672,834 bytes per MUT second Productivity 85.0% of total user, 95.2% of total elapsed

Almost 50 sparks were converted here, i.e. significant parallel work was done, but the calculations were not large enough to observe the growth of the wall clock from parallelism. Any benefits were likely offset by the overhead of scheduling computations in a streaming runtime.

Parallel to Haskell, to find huge divisors

More articles: