Haskell: unexpected time complexity when calculating with large lists

Question

Haskell: unexpected time complexity when calculating with large lists

I am dealing with a calculation having as an intermediate result the list A = [B], which is a list K of lists of length L. The time complexity for calculating the element B is controlled by the parameter M and theoretically linearly in M. Theoretically, I would expect time complexity computing A will be equal to O (K * L * M). However, this is not so, and I do not understand why?

Here is a simple complete sketch program that shows the problem I explained.

import System.Random (randoms, mkStdGen) import Control.Parallel.Strategies (parMap, rdeepseq) import Control.DeepSeq (NFData) import Data.List (transpose) type Point = (Double, Double) fmod :: Double -> Double -> Double fmod ab | a < 0 = b - fmod (abs a) b | otherwise = if a < b then a else let q = a / b in b * (q - fromIntegral (floor q)) standardMap :: Double -> Point -> Point standardMap k (q, p) = (fmod (q + p) (2 * pi), fmod (p + k * sin(q)) (2 * pi)) trajectory :: (Point -> Point) -> Point -> [Point] trajectory map initial = initial : (trajectory map $ map initial) justEvery :: Int -> [a] -> [a] justEvery n (x:xs) = x : (justEvery n $ drop (n-1) xs) justEvery _ [] = [] subTrace :: Int -> Int -> [a] -> [a] subTrace nm = take (n + 1) . justEvery m ensemble :: Int -> [Point] ensemble n = let qs = randoms (mkStdGen 42) ps = randoms (mkStdGen 21) in take n $ zip qs ps ensembleTrace :: NFData a => (Point -> [Point]) -> (Point -> a) -> Int -> Int -> [Point] -> [[a]] ensembleTrace orbitGen observable nm = parMap rdeepseq ((map observable . subTrace nm) . orbitGen) main = let k = 100 l = 100 m = 100 orbitGen = trajectory (standardMap 7) observable (p, q) = p^2 - q^2 initials = ensemble k mean xs = (sum xs) / (fromIntegral $ length xs) result = (map mean) $ transpose $ ensembleTrace orbitGen observable lm $ initials in mapM_ print result

I am compiling with

 $ ghc -O2 stdmap.hs -threaded

and launch using

 $ ./stdmap +RTS -N4 > /dev/null

on Intel Q6600, Linux 3.6.3-1-ARCH, with GHC 7.6.1 and get the following results for different sets of parameters K, L, M (k, l, m in the program code)

 (K=200,L=200,N=200) -> real 0m0.774s user 0m2.856s sys 0m0.147s (K=2000,L=200,M=200) -> real 0m7.409s user 0m28.102s sys 0m1.080s (K=200,L=2000,M=200) -> real 0m7.326s user 0m27.932s sys 0m1.020s (K=200,L=200,M=2000) -> real 0m10.581s user 0m38.564s sys 0m3.376s (K=20000,L=200,M=200) -> real 4m22.156s user 7m30.007s sys 0m40.321s (K=200,L=20000,M=200) -> real 1m16.222s user 4m45.891s sys 0m15.812s (K=200,L=200,M=20000) -> real 8m15.060s user 23m10.909s sys 9m24.450s

I don’t quite understand where the problem of such pure scaling is. If I understand correctly, lists are lazy and should not be built as they are consumed in the head-tail direction? As can be seen from the measurements, there is a correlation between excessive consumption in real time and excessive consumption of system time, since the excess will be on the system account. But if there is some memory management time, it should still scale linearly in K, L, M.

Help!

EDIT

I made changes to the code in accordance with the suggestions of Daniel Fisher, who really decided poor scaling with respect to M. As indicated, forcing a strict assessment of the trajectory, we avoid the construction of large tricks. I understand the performance improvement behind this, but I still don’t understand the poor scaling of the source code, because (if I understand correctly) the spatio-temporal complexity of the thunk construct should be linear in M?

Also, I'm still having problems understanding poor scaling with respect to K (ensemble size). I performed two additional measurements with an improved code for K = 8000 and K = 16000, keeping L = 200, M = 200. Scaling to K = 8000 is as expected, but for K = 16000 it is already abnormal. It seems that the problem is related to the number of overflowed SPARKS , which is 0 for K = 8000 and 7802 for K = 16000. This is probably reflected in poor concurrency, which I define as the coefficient Q = (MUT cpu time) / (MUT real time) which would ideally be equal to the number of CPU-s. However, Q ~ 4 for K = 8000 and Q ~ 2 for K = 16000. Please help me understand the origin of this problem and possible solutions.

 K = 8000: $ ghc -O2 stmap.hs -threaded -XBangPatterns $ ./stmap +RTS -s -N4 > /dev/null 56,905,405,184 bytes allocated in the heap 503,501,680 bytes copied during GC 53,781,168 bytes maximum residency (15 sample(s)) 6,289,112 bytes maximum slop 151 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 27893 colls, 27893 par 7.85s 1.99s 0.0001s 0.0089s Gen 1 15 colls, 14 par 1.20s 0.30s 0.0202s 0.0558s Parallel GC work balance: 23.49% (serial 0%, perfect 100%) TASKS: 6 (1 bound, 5 peak workers (5 total), using -N4) SPARKS: 8000 (8000 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 0.00s ( 0.00s elapsed) MUT time 95.90s ( 24.28s elapsed) GC time 9.04s ( 2.29s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 104.95s ( 26.58s elapsed) Alloc rate 593,366,811 bytes per MUT second Productivity 91.4% of total user, 360.9% of total elapsed gc_alloc_block_sync: 315819

and

 K = 16000: $ ghc -O2 stmap.hs -threaded -XBangPatterns $ ./stmap +RTS -s -N4 > /dev/null 113,809,786,848 bytes allocated in the heap 1,156,991,152 bytes copied during GC 114,778,896 bytes maximum residency (18 sample(s)) 11,124,592 bytes maximum slop 300 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 135521 colls, 135521 par 22.83s 6.59s 0.0000s 0.0190s Gen 1 18 colls, 17 par 2.72s 0.73s 0.0405s 0.1692s Parallel GC work balance: 18.05% (serial 0%, perfect 100%) TASKS: 6 (1 bound, 5 peak workers (5 total), using -N4) SPARKS: 16000 (8198 converted, 7802 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 0.00s ( 0.00s elapsed) MUT time 221.77s (139.78s elapsed) GC time 25.56s ( 7.32s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 247.34s (147.10s elapsed) Alloc rate 513,176,874 bytes per MUT second Productivity 89.7% of total user, 150.8% of total elapsed gc_alloc_block_sync: 814824

+4

performance list time-complexity haskell ghc

Benjamin batistic Nov 08 '12 at 16:44

source share

2 answers

After some quick profiling, I discovered that these are serial intruders:

 ghc --make -O2 MainNonOpt.hs -threaded -prof -auto-all -caf-all -fforce-recomp ./MainNonOpt +RTS -N4 -p > /dev/null >>> COST CENTRE MODULE %time %alloc fmod Main 46.3 33.3 standardMap Main 28.5 0.0 trajectory Main 23.8 66.6

What is surprising is that fmod is a large number of distributions, which it considers, basically, it is a numerical function. So the next step is to fmod annotation to find out where the problem is:

 fmod :: Double -> Double -> Double fmod ab | a < 0 = {-# SCC "negbranch" #-} b - fmod (abs a) b | otherwise = {-# SCC "posbranch" #-} if a < b then a else let q = {-# SCC "division" #-} a / b in {-# SCC "expression" #-} b * (q - {-# SCC "floor" #-} fromIntegral (floor q))

This gives us:

 ghc --make -O2 MainNonOpt.hs -threaded -prof -caf-all -fforce-recomp ./MainNonOpt +RTS -N4 -p > /dev/null COST CENTRE MODULE %time %alloc MAIN MAIN 61.5 70.0 posbranch Main 16.6 0.0 floor Main 14.9 30.0 expression Main 4.5 0.0 negbranch Main 1.9 0.0

So the floor bit is the one that causes the problems. Looking back, it turns out that Prelude does not implement some Double RealFrac features as best as possible (see here ), probably causing some boxing / unpacking.

Thus, following the recommendations of the link, I used a modified version of the floor, which also made the call fromIntegral unnecessary:

 floor' :: Double -> Double floor' x = c_floor x {-# INLINE floor' #-} foreign import ccall unsafe "math.h floor" c_floor :: Double -> Double fmod :: Double -> Double -> Double fmod ab | a < 0 = {-# SCC "negbranch" #-} b - fmod (abs a) b | otherwise = {-# SCC "posbranch" #-} if a < b then a else let q = {-# SCC "division" #-} a / b in {-# SCC "expression" #-} b * (q - ({-# SCC "floor" #-} floor' q))

EDIT : As Daniel Fisher notes, there is no need to embed C code to improve performance. A similar Haskell function already exists. I will leave an answer anyway, for further reference.

It matters. On my machine, for k = l = 200, M = 5000 is the number for the non-optimized and optimized version:

Not optimized:

 real 0m20.635s user 1m17.321s sys 0m4.980s

Optimized:

 real 0m14.858s user 0m55.271s sys 0m3.815s

The trajectory function may have similar problems, and you can use profiling as used above to indicate the problem.

An excellent starting point for profiling in Haskell can be found in this chapter of Real World Haskell.

+2

Marius danila Nov 08 '12 at 23:41

source share

Daniel Fischer · Accepted Answer · 2012-11-09T01:37:15+0000

The M. AD point near fmod is good, but there is no need to call in C, and we can better stop on the Haskell land (the ticket related flow was roughly fixed). The problem is in

 fmod :: Double -> Double -> Double fmod ab | a < 0 = b - fmod (abs a) b | otherwise = if a < b then a else let q = a / b in b * (q - fromIntegral (floor q))

is that the default type causes floor :: Double -> Integer (and therefore fromIntegral :: Integer -> Double ) to be called. Integer is now a relatively complex type with slow operations, and converting from Integer to Double also relatively complicated. The source code (with parameters k = l = 200 and m = 5000 ) created statistics

 ./nstdmap +RTS -s -N2 > /dev/null 60,601,075,392 bytes allocated in the heap 36,832,004,184 bytes copied during GC 2,435,272 bytes maximum residency (13741 sample(s)) 887,768 bytes maximum slop 9 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 46734 colls, 46734 par 41.66s 20.87s 0.0004s 0.0058s Gen 1 13741 colls, 13740 par 23.18s 11.62s 0.0008s 0.0041s Parallel GC work balance: 60.58% (serial 0%, perfect 100%) TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2) SPARKS: 200 (200 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 0.00s ( 0.00s elapsed) MUT time 34.99s ( 17.60s elapsed) GC time 64.85s ( 32.49s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 99.84s ( 50.08s elapsed) Alloc rate 1,732,048,869 bytes per MUT second Productivity 35.0% of total user, 69.9% of total elapsed

on my machine ( -N2 because I only have two physical cores). A simple code change to use a signature like floor q :: Int results in up to

 ./nstdmap +RTS -s -N2 > /dev/null 52,105,495,488 bytes allocated in the heap 29,957,007,208 bytes copied during GC 2,440,568 bytes maximum residency (10481 sample(s)) 893,224 bytes maximum slop 8 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 36979 colls, 36979 par 32.96s 16.51s 0.0004s 0.0066s Gen 1 10481 colls, 10480 par 16.65s 8.34s 0.0008s 0.0018s Parallel GC work balance: 68.64% (serial 0%, perfect 100%) TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2) SPARKS: 200 (200 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 0.01s ( 0.01s elapsed) MUT time 29.78s ( 14.94s elapsed) GC time 49.61s ( 24.85s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 79.40s ( 39.80s elapsed) Alloc rate 1,749,864,775 bytes per MUT second Productivity 37.5% of total user, 74.8% of total elapsed

a decrease of about 20% over the past time, 13% in MUT. Not bad. If we look at the code for floor that you get with optimization, we can understand why:

 floorDoubleInt :: Double -> Int floorDoubleInt (D# x) = case double2Int# x of n | x <## int2Double# n -> I# (n -# 1#) | otherwise -> I# n floorDoubleInteger :: Double -> Integer floorDoubleInteger (D# x) = case decodeDoubleInteger x of (# m, e #) | e <# 0# -> case negateInt# e of s | s ># 52# -> if m < 0 then (-1) else 0 | otherwise -> case TO64 m of n -> FROM64 (n `uncheckedIShiftRA64#` s) | otherwise -> shiftLInteger me

floor :: Double -> Int just uses machine conversion, and floor :: Double -> Integer requires expensive decodeDoubleInteger and more branches. But where floor is called here, we know that all involved Double non-negative, so floor same as truncate , which directly translates to machine conversion double2Int# , so try instead of floor :

  INIT time 0.00s ( 0.00s elapsed) MUT time 29.29s ( 14.70s elapsed) GC time 49.17s ( 24.62s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 78.45s ( 39.32s elapsed)

a really small reduction (you can expect fmod is not really a bottleneck). For comparison, calling C:

  INIT time 0.01s ( 0.01s elapsed) MUT time 31.46s ( 15.78s elapsed) GC time 54.05s ( 27.06s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 85.52s ( 42.85s elapsed)

a bit slower (no wonder you can execute multiple primitives during a C call).

But this is not where the big fish swims. The bad news is that selecting only every m element of the paths leads to big tricks that cause a lot of highlighting and take time to evaluate when the time comes. Therefore, eliminate this leak and make the trajectories strict:

 {-# LANGUAGE BangPatterns #-} trajectory :: (Point -> Point) -> Point -> [Point] trajectory map !initial@ (!a,!b) = initial : (trajectory map $ map initial)

This significantly reduces the distribution and time of the GC, and, as a result, the time of the MUT:

  INIT time 0.00s ( 0.00s elapsed) MUT time 21.83s ( 10.95s elapsed) GC time 0.72s ( 0.36s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 22.55s ( 11.31s elapsed)

with the original fmod ,

  INIT time 0.00s ( 0.00s elapsed) MUT time 18.26s ( 9.18s elapsed) GC time 0.58s ( 0.29s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 18.84s ( 9.47s elapsed)

with floor q :: Int and within the measurement accuracy at the same time for truncate q :: Int (distribution figures are slightly lower for truncate ).

The problem is the number of overloaded SPARKS, which is 0 for K = 8000 and 7802 for K = 16000. This is probably reflected in poor concurrency

Yes (although, as far as I know, parallelism instead of concurrency is the more correct term), there is a spark pool, and when it is complete, any further sparks are not planned for evaluation in any thread next has a time when its turn comes, then the calculation is performed without parallelism, from the parent thread. In this case, this means that after the initial parallel phase, the calculation returns to sequential.

The size of the spark pool, apparently, is about 8 K (2 ^ 13).

If you watch the processor load from above, you will see that after a while it will fall from (close to 100%)*(number of cores) to a much lower value (for me it was ~ 100% with -N2 and ~ 130% with -N4 ).

We will cure ourselves to avoid too much sparking, and to allow each spark to do even more work. With a quick and dirty modification

 ensembleTrace orbitGen observable nm = withStrategy (parListChunk 25 rdeepseq) . map ((map observable . subTrace nm) . orbitGen)

I will return to 200% with -N2 for almost the entire run and good performance,

  INIT time 0.00s ( 0.00s elapsed) MUT time 57.42s ( 29.02s elapsed) GC time 5.34s ( 2.69s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 62.76s ( 31.71s elapsed) Alloc rate 1,982,155,167 bytes per MUT second Productivity 91.5% of total user, 181.1% of total elapsed

and with -N4 it is also thin (even a little faster on the wall clock - not so much, because all the threads are basically the same, and I only have 2 physical cores)

  INIT time 0.00s ( 0.00s elapsed) MUT time 99.17s ( 26.31s elapsed) GC time 16.18s ( 4.80s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 115.36s ( 31.12s elapsed) Alloc rate 1,147,619,609 bytes per MUT second Productivity 86.0% of total user, 318.7% of total elapsed

since now the spark beam does not overflow.

The correct correction is to make the size of the pieces a parameter, which is calculated from the number of trajectories and available cores, so that the number of sparks does not exceed the pool size.

Haskell: unexpected time complexity when calculating with large lists

More articles: