I'm not sure if it's worth noting, given @lehins' excellent answer, but ...
Why is your pQuickSort not working
There are two big problems with your pQuickSort . First, you use System.Random , which is too slow and weirdly interacts with parallel sorting (see below). Secondly, your par ul ignites calculations to evaluate:
u = [x] ++ pQuicksort (n 'div' 2) upper
for WHNF, namely u = x: UNEVALUATED_THUNK , so your sparks don't do any real work.
Observe improvement with simple pseudo-quick sort
In fact, it is not difficult to observe a performance improvement in parallelizing naive, out of place, pseudo-fast sorting. As already mentioned, an important consideration is to avoid using System.Random . With fast LCG, we can measure the actual sorting time, rather than some weird mixture of sorting and generating random numbers. The following pseudo-quick sort:
import Data.List qsort :: Ord a => [a] -> [a] qsort (x:xs) = let (a,b) = partition (<=x) xs in qsort a ++ x:qsort b qsort [] = [] randomList :: Int -> [Int] randomList n = take n $ tail (iterate lcg 1) where lcg x = (a * x + c) 'rem' m a = 1664525 c = 1013904223 m = 2^32 main :: IO () main = do let randints = randomList 5000000 print . sum $ qsort randints
when compiling with GHC 8.6.4 and -O2 takes about 9.7 seconds. The following "parallelized" version:
qsort :: Ord a => [a] -> [a] qsort (x:xs) = let (a,b) = partition (<=x) xs a' = qsort a b' = qsort b in (b' 'par' a') ++ x:b' qsort [] = []
compiled with ghc -O2 -threaded starts in about 11.0 seconds on one possibility. Add +RTS -N4 and it will +RTS -N4 in 7.1 seconds.
TA-dah! Improvement.
(In contrast, the version with System.Random runs for about 13 seconds for the non-parallel version, about 12 seconds for the parallel version for one feature (perhaps only because of a slight improvement in stringency) and the feature is significantly slowed down, time also unstable, although I'm not quite sure why.)
Partition partition
One obvious problem with this version is that even if a' = qsort a and b' = qsort b are executed in parallel, they are associated with the same sequential partition call. Dividing this into two filters:
qsort :: Ord a => [a] -> [a] qsort (x:xs) = let a = qsort $ filter (<=x) xs b = qsort $ filter (>x) xs in b 'par' a ++ x:b qsort [] = []
using -N4 up to 5.5 seconds. To be fair, even the non-parallel version is actually a bit faster with two filters instead of calling partition , at least when sorting Ints . There are probably some additional optimizations that are possible with filters compared to the section that justify additional comparisons.
Spark reduction
Now what you tried to do in pQuickSort above is to restrict parallel computing to the pQuickSort recursion set. Let's experiment with this following psort :
psort :: Ord a => Int -> [a] -> [a] psort n (x:xs) = let a = psort (n-1) $ filter (<=x) xs b = psort (n-1) $ filter (>x) xs in if n > 0 then b 'par' a ++ x:b else a ++ x:b psort _ [] = []
This will parallelize the upper n layers of recursion. My specific LCG example with an initial iterate lcg 1 1 (i.e. iterate lcg 1 ) recurses up to 54 layers, so psort 55 should give the same performance as a fully parallel version, with the exception of the overhead of layer tracking. When I run it, I get a time of about 5.8 seconds with -N4 , so the overhead is pretty small.
Now let's see what happens when we reduce the number of layers:
| Layers | 55 | 40 | 30 | 20 | 10 | 5 | 3 | 1 | |--------+-----+-----+-----+-----+-----+-----+-----+------| | time | 5.5 | 5.6 | 5.7 | 5.4 | 7.0 | 8.9 | 9.8 | 10.2 |
Note that at the lowest levels, little can be gained from parallel computing. This is mainly because the average depth of the tree is probably about 25 layers or so, so there are only a few calculations on 50 layers, many of which have strange, one-sided partitions, and they are certainly too small to parallelize. On the other hand, there seems to be no penalty for these extra par .
Meanwhile, there is an increase in the gain up to at least 20 layers, so the attempt to artificially limit the total number of sparks to 16 (for example, 4 or 5 upper layers) is a big loss.