The M. AD point near fmod is good, but there is no need to call in C, and we can better stop on the Haskell land (the ticket related flow was roughly fixed). The problem is in
fmod :: Double -> Double -> Double fmod ab | a < 0 = b - fmod (abs a) b | otherwise = if a < b then a else let q = a / b in b * (q - fromIntegral (floor q))
is that the default type causes floor :: Double -> Integer (and therefore fromIntegral :: Integer -> Double ) to be called. Integer is now a relatively complex type with slow operations, and converting from Integer to Double also relatively complicated. The source code (with parameters k = l = 200 and m = 5000 ) created statistics
./nstdmap +RTS -s -N2 > /dev/null 60,601,075,392 bytes allocated in the heap 36,832,004,184 bytes copied during GC 2,435,272 bytes maximum residency (13741 sample(s)) 887,768 bytes maximum slop 9 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 46734 colls, 46734 par 41.66s 20.87s 0.0004s 0.0058s Gen 1 13741 colls, 13740 par 23.18s 11.62s 0.0008s 0.0041s Parallel GC work balance: 60.58% (serial 0%, perfect 100%) TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2) SPARKS: 200 (200 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 0.00s ( 0.00s elapsed) MUT time 34.99s ( 17.60s elapsed) GC time 64.85s ( 32.49s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 99.84s ( 50.08s elapsed) Alloc rate 1,732,048,869 bytes per MUT second Productivity 35.0% of total user, 69.9% of total elapsed
on my machine ( -N2 because I only have two physical cores). A simple code change to use a signature like floor q :: Int results in up to
./nstdmap +RTS -s -N2 > /dev/null 52,105,495,488 bytes allocated in the heap 29,957,007,208 bytes copied during GC 2,440,568 bytes maximum residency (10481 sample(s)) 893,224 bytes maximum slop 8 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 36979 colls, 36979 par 32.96s 16.51s 0.0004s 0.0066s Gen 1 10481 colls, 10480 par 16.65s 8.34s 0.0008s 0.0018s Parallel GC work balance: 68.64% (serial 0%, perfect 100%) TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2) SPARKS: 200 (200 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 0.01s ( 0.01s elapsed) MUT time 29.78s ( 14.94s elapsed) GC time 49.61s ( 24.85s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 79.40s ( 39.80s elapsed) Alloc rate 1,749,864,775 bytes per MUT second Productivity 37.5% of total user, 74.8% of total elapsed
a decrease of about 20% over the past time, 13% in MUT. Not bad. If we look at the code for floor that you get with optimization, we can understand why:
floorDoubleInt :: Double -> Int floorDoubleInt (D# x) = case double2Int# x of n | x <## int2Double# n -> I# (n -# 1#) | otherwise -> I# n floorDoubleInteger :: Double -> Integer floorDoubleInteger (D# x) = case decodeDoubleInteger x of (# m, e #) | e <# 0# -> case negateInt# e of s | s ># 52# -> if m < 0 then (-1) else 0 | otherwise -> case TO64 m of n -> FROM64 (n `uncheckedIShiftRA64#` s) | otherwise -> shiftLInteger me
floor :: Double -> Int just uses machine conversion, and floor :: Double -> Integer requires expensive decodeDoubleInteger and more branches. But where floor is called here, we know that all involved Double non-negative, so floor same as truncate , which directly translates to machine conversion double2Int# , so try instead of floor :
INIT time 0.00s ( 0.00s elapsed) MUT time 29.29s ( 14.70s elapsed) GC time 49.17s ( 24.62s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 78.45s ( 39.32s elapsed)
a really small reduction (you can expect fmod is not really a bottleneck). For comparison, calling C:
INIT time 0.01s ( 0.01s elapsed) MUT time 31.46s ( 15.78s elapsed) GC time 54.05s ( 27.06s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 85.52s ( 42.85s elapsed)
a bit slower (no wonder you can execute multiple primitives during a C call).
But this is not where the big fish swims. The bad news is that selecting only every m element of the paths leads to big tricks that cause a lot of highlighting and take time to evaluate when the time comes. Therefore, eliminate this leak and make the trajectories strict:
{-# LANGUAGE BangPatterns #-} trajectory :: (Point -> Point) -> Point -> [Point] trajectory map !initial@ (!a,!b) = initial : (trajectory map $ map initial)
This significantly reduces the distribution and time of the GC, and, as a result, the time of the MUT:
INIT time 0.00s ( 0.00s elapsed) MUT time 21.83s ( 10.95s elapsed) GC time 0.72s ( 0.36s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 22.55s ( 11.31s elapsed)
with the original fmod ,
INIT time 0.00s ( 0.00s elapsed) MUT time 18.26s ( 9.18s elapsed) GC time 0.58s ( 0.29s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 18.84s ( 9.47s elapsed)
with floor q :: Int and within the measurement accuracy at the same time for truncate q :: Int (distribution figures are slightly lower for truncate ).
The problem is the number of overloaded SPARKS, which is 0 for K = 8000 and 7802 for K = 16000. This is probably reflected in poor concurrency
Yes (although, as far as I know, parallelism instead of concurrency is the more correct term), there is a spark pool, and when it is complete, any further sparks are not planned for evaluation in any thread next has a time when its turn comes, then the calculation is performed without parallelism, from the parent thread. In this case, this means that after the initial parallel phase, the calculation returns to sequential.
The size of the spark pool, apparently, is about 8 K (2 ^ 13).
If you watch the processor load from above, you will see that after a while it will fall from (close to 100%)*(number of cores) to a much lower value (for me it was ~ 100% with -N2 and ~ 130% with -N4 ).
We will cure ourselves to avoid too much sparking, and to allow each spark to do even more work. With a quick and dirty modification
ensembleTrace orbitGen observable nm = withStrategy (parListChunk 25 rdeepseq) . map ((map observable . subTrace nm) . orbitGen)
I will return to 200% with -N2 for almost the entire run and good performance,
INIT time 0.00s ( 0.00s elapsed) MUT time 57.42s ( 29.02s elapsed) GC time 5.34s ( 2.69s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 62.76s ( 31.71s elapsed) Alloc rate 1,982,155,167 bytes per MUT second Productivity 91.5% of total user, 181.1% of total elapsed
and with -N4 it is also thin (even a little faster on the wall clock - not so much, because all the threads are basically the same, and I only have 2 physical cores)
INIT time 0.00s ( 0.00s elapsed) MUT time 99.17s ( 26.31s elapsed) GC time 16.18s ( 4.80s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 115.36s ( 31.12s elapsed) Alloc rate 1,147,619,609 bytes per MUT second Productivity 86.0% of total user, 318.7% of total elapsed
since now the spark beam does not overflow.
The correct correction is to make the size of the pieces a parameter, which is calculated from the number of trajectories and available cores, so that the number of sparks does not exceed the pool size.