Is it possible to speed up high-speed pairing sorting in Haskell?

I got this seemingly trivial parallel implementation of quick sort, the code is as follows:

import System.Random import Control.Parallel import Data.List quicksort :: Ord a => [a] -> [a] quicksort xs = pQuicksort 16 xs -- 16 is the number of sparks used to sort -- pQuicksort, parallelQuicksort -- As long as n > 0 evaluates the lower and upper part of the list in parallel, -- when we have recursed deep enough, n==0, this turns into a serial quicksort. pQuicksort :: Ord a => Int -> [a] -> [a] pQuicksort _ [] = [] pQuicksort 0 (x:xs) = let (lower, upper) = partition (< x) xs in pQuicksort 0 lower ++ [x] ++ pQuicksort 0 upper pQuicksort n (x:xs) = let (lower, upper) = partition (< x) xs l = pQuicksort (n 'div' 2) lower u = [x] ++ pQuicksort (n 'div' 2) upper in (par ul) ++ u main :: IO () main = do gen <- getStdGen let randints = (take 5000000) $ randoms gen :: [Int] putStrLn . show . sum $ (quicksort randints) 

I collect with

 ghc --make -threaded -O2 quicksort.hs 

and run with

 ./quicksort +RTS -N16 -RTS 

No matter what I do, I can't get it to work faster than a simple sequential implementation running on a single processor.

  1. Can anyone explain why this works much slower on multiple processors than on one?
  2. Is it possible to make this scale, at least sublinearly, with the number of processors by doing some sort of trick?

EDIT: @tempestadept hinted that quicksort is a problem. To test this, I implemented a simple merge sort in the same spirit as in the example above. It has the same behavior, it runs slower the more features you add.

 import System.Random import Control.Parallel splitList :: [a] -> ([a], [a]) splitList = helper True [] [] where helper _ left right [] = (left, right) helper True left right (x:xs) = helper False (x:left) right xs helper False left right (x:xs) = helper True left (x:right) xs merge :: (Ord a) => [a] -> [a] -> [a] merge xs [] = xs merge [] ys = ys merge (x:xs) (y:ys) = case x<y of True -> x : merge xs (y:ys) False -> y : merge (x:xs) ys mergeSort :: (Ord a) => [a] -> [a] mergeSort xs = pMergeSort 16 xs -- we use 16 sparks -- pMergeSort, parallel merge sort. Takes an extra argument -- telling how many sparks to create. In our simple test it is -- set to 16 pMergeSort :: (Ord a) => Int -> [a] -> [a] pMergeSort _ [] = [] pMergeSort _ [a] = [a] pMergeSort 0 xs = let (left, right) = splitList xs in merge (pMergeSort 0 left) (pMergeSort 0 right) pMergeSort n xs = let (left, right) = splitList xs l = pMergeSort (n 'div' 2) left r = pMergeSort (n 'div' 2) right in (r 'par' l) 'pseq' (merge lr) ris :: Int -> IO [Int] ris n = do gen <- getStdGen return . (take n) $ randoms gen main = do r <- ris 100000 putStrLn . show . sum $ mergeSort r 
+10
source share
5 answers

I'm not sure how well it can work for idiomatic quicksort, but it can work (to a slightly weak degree) for true imperative quicksort, as shown by Roman in Sparkling Imperatives .

However, he never got a good boost. To do this, you really need a real working theft, which does not overflow like a spark queue for optimal optimization.
+3
source

You will not get a noticeable improvement, since your pseudo-quick sorting involves combining lists that cannot be parallelized and requires quadratic time (total time for all concatenations). I would advise you to try working with a union that is O(n log n) in linked lists.

In addition, to run a program with a large number of threads, you must compile it with -rtsopts .

+2
source

par evaluates only the first argument in the form of a weak head. To say: if the first type of the argument is Maybe Int , then par will check whether the result is Nothing or Just something and is stopped. He will not evaluate something . Similarly for lists, it evaluates enough enough to check if there is a list [] or something:something_else . To evaluate the entire list in parallel: you do not pass the list directly to par , you create an expression that depends on the list so that when you pass it to par need the whole list. For instance:

 evalList :: [a] -> () evalList [] = () evalList (a:r) = a `pseq` evalList r pMergeSort :: (Ord a) => Int -> [a] -> [a] pMergeSort _ [] = [] pMergeSort _ [a] = [a] pMergeSort 0 xs = let (left, right) = splitList xs in merge (pMergeSort 0 left) (pMergeSort 0 right) pMergeSort n xs = let (left, right) = splitList xs l = pMergeSort (n `div` 2) left r = pMergeSort (n `div` 2) right in (evalList r `par` l) `pseq` (merge lr) 

Another note: the overhead for starting new threads in Haskell is really low, so the case for pMergeSort 0 ... probably won't come in handy.

+2
source

There are several issues that have already been mentioned:

  • Using lists will not give the desired result. Even in this example implementation using a vector, the coefficient is 50 times faster than when using lists, since it performs the replacement of elements in place. For this reason, my answer will include an implementation using the library massiv array, not lists.
  • I usually find the Haskell scheduler far from perfect for the task processor, since @Edward Kmett noted in his answer that we need the job of stealing the scheduler, which I conveniently implemented for the libraries mentioned above: scheduler
 -- A helper function that partitions a region of a mutable array. unstablePartitionRegionM :: forall re m. (Mutable r Ix1 e, PrimMonad m) => MArray (PrimState m) r Ix1 e -> (e -> Bool) -> Ix1 -- ^ Start index of the region -> Ix1 -- ^ End index of the region -> m Ix1 unstablePartitionRegionM marr f start end = fromLeft start (end + 1) where fromLeft ij | i == j = pure i | otherwise = do x <- A.unsafeRead marr i if fx then fromLeft (i + 1) j else fromRight i (j - 1) fromRight ij | i == j = pure i | otherwise = do x <- A.unsafeRead marr j if fx then do A.unsafeWrite marr j =<< A.unsafeRead marr i A.unsafeWrite marr ix fromLeft (i + 1) j else fromRight i (j - 1) {-# INLINE unstablePartitionRegionM #-} 

Here is the actual quick sort in place

 quicksortMArray :: (Ord e, Mutable r Ix1 e, PrimMonad m) => Int -> (m () -> m ()) -> A.MArray (PrimState m) r Ix1 e -> m () quicksortMArray numWorkers schedule marr = schedule $ qsort numWorkers 0 (unSz (msize marr) - 1) where qsort n !lo !hi = when (lo < hi) $ do p <- A.unsafeRead marr hi l <- unstablePartitionRegionM marr (< p) lo hi A.unsafeWrite marr hi =<< A.unsafeRead marr l A.unsafeWrite marr lp if n > 0 then do let !n' = n - 1 schedule $ qsort n' lo (l - 1) schedule $ qsort n' (l + 1) hi else do qsort n lo (l - 1) qsort n (l + 1) hi {-# INLINE quicksortMArray #-} 

Now, if we look at the arguments to numWorkers and schedule they are quite opaque. Say, if we provide 1 for the first argument and id for the second, we will just have sequential quick sort, but if we have an available function that can schedule each task to be evaluated simultaneously, we will get a parallel implementation of quick sort. Luckily for us, massiv provides this out of the box using withMArray :

 withMArray :: (Mutable r ix e, MonadUnliftIO m) => Array r ix e -> (Int -> (m () -> m ()) -> MArray RealWorld r ix e -> ma) -> m (Array r ix e) 

Here is a clean version that will make a copy of the array and then sort it in place using the calculation strategy specified in the array itself:

 quicksortArray :: (Mutable r Ix1 e, Ord e) => Array r Ix1 e -> Array r Ix1 e quicksortArray arr = unsafePerformIO $ withMArray arr quicksortMArray {-# INLINE quicksortArray #-} 

Here comes the best part, tests. Order of results:

  • Intro sorting by vector-algorithms
  • In-place quick sort using vector from this answer
  • C implementation I took from this question
  • Sequential quick sort using massiv
  • The same as above, but in parallel on a computer with a modest 4-core third-generation i7 processor with hyperthreading
 benchmarking QuickSort/Vector Algorithms time 101.3 ms (93.75 ms .. 107.8 ms) 0.991 R² (0.974 R² .. 1.000 R²) mean 97.13 ms (95.17 ms .. 100.2 ms) std dev 4.127 ms (2.465 ms .. 5.663 ms) benchmarking QuickSort/Vector time 89.51 ms (87.69 ms .. 91.92 ms) 0.999 R² (0.997 R² .. 1.000 R²) mean 92.67 ms (91.54 ms .. 94.50 ms) std dev 2.438 ms (1.468 ms .. 3.493 ms) benchmarking QuickSort/C time 88.14 ms (86.71 ms .. 89.41 ms) 1.000 R² (0.999 R² .. 1.000 R²) mean 90.11 ms (89.17 ms .. 93.35 ms) std dev 2.744 ms (387.1 μs .. 4.686 ms) benchmarking QuickSort/Array time 76.07 ms (75.77 ms .. 76.41 ms) 1.000 R² (1.000 R² .. 1.000 R²) mean 76.08 ms (75.75 ms .. 76.28 ms) std dev 453.7 μs (247.8 μs .. 699.6 μs) benchmarking QuickSort/Array Par time 25.25 ms (24.84 ms .. 25.61 ms) 0.999 R² (0.997 R² .. 1.000 R²) mean 25.13 ms (24.80 ms .. 25.75 ms) std dev 991.6 μs (468.5 μs .. 1.782 ms) 

Tests sort 1,000,000 random Int64 s. If you want to see the full code, you can find it here: https://github.com/lehins/haskell-quicksort

To summarize, we got 3 times acceleration on a quad-core processor and 8 features, which sounds good to me. Thanks for this question, now I can add a sort function to massiv ;)

+1
source

I'm not sure if it's worth noting, given @lehins' excellent answer, but ...

Why is your pQuickSort not working

There are two big problems with your pQuickSort . First, you use System.Random , which is too slow and weirdly interacts with parallel sorting (see below). Secondly, your par ul ignites calculations to evaluate:

 u = [x] ++ pQuicksort (n 'div' 2) upper 

for WHNF, namely u = x: UNEVALUATED_THUNK , so your sparks don't do any real work.

Observe improvement with simple pseudo-quick sort

In fact, it is not difficult to observe a performance improvement in parallelizing naive, out of place, pseudo-fast sorting. As already mentioned, an important consideration is to avoid using System.Random . With fast LCG, we can measure the actual sorting time, rather than some weird mixture of sorting and generating random numbers. The following pseudo-quick sort:

 import Data.List qsort :: Ord a => [a] -> [a] qsort (x:xs) = let (a,b) = partition (<=x) xs in qsort a ++ x:qsort b qsort [] = [] randomList :: Int -> [Int] randomList n = take n $ tail (iterate lcg 1) where lcg x = (a * x + c) 'rem' m a = 1664525 c = 1013904223 m = 2^32 main :: IO () main = do let randints = randomList 5000000 print . sum $ qsort randints 

when compiling with GHC 8.6.4 and -O2 takes about 9.7 seconds. The following "parallelized" version:

 qsort :: Ord a => [a] -> [a] qsort (x:xs) = let (a,b) = partition (<=x) xs a' = qsort a b' = qsort b in (b' 'par' a') ++ x:b' qsort [] = [] 

compiled with ghc -O2 -threaded starts in about 11.0 seconds on one possibility. Add +RTS -N4 and it will +RTS -N4 in 7.1 seconds.

TA-dah! Improvement.

(In contrast, the version with System.Random runs for about 13 seconds for the non-parallel version, about 12 seconds for the parallel version for one feature (perhaps only because of a slight improvement in stringency) and the feature is significantly slowed down, time also unstable, although I'm not quite sure why.)

Partition partition

One obvious problem with this version is that even if a' = qsort a and b' = qsort b are executed in parallel, they are associated with the same sequential partition call. Dividing this into two filters:

 qsort :: Ord a => [a] -> [a] qsort (x:xs) = let a = qsort $ filter (<=x) xs b = qsort $ filter (>x) xs in b 'par' a ++ x:b qsort [] = [] 

using -N4 up to 5.5 seconds. To be fair, even the non-parallel version is actually a bit faster with two filters instead of calling partition , at least when sorting Ints . There are probably some additional optimizations that are possible with filters compared to the section that justify additional comparisons.

Spark reduction

Now what you tried to do in pQuickSort above is to restrict parallel computing to the pQuickSort recursion set. Let's experiment with this following psort :

 psort :: Ord a => Int -> [a] -> [a] psort n (x:xs) = let a = psort (n-1) $ filter (<=x) xs b = psort (n-1) $ filter (>x) xs in if n > 0 then b 'par' a ++ x:b else a ++ x:b psort _ [] = [] 

This will parallelize the upper n layers of recursion. My specific LCG example with an initial iterate lcg 1 1 (i.e. iterate lcg 1 ) recurses up to 54 layers, so psort 55 should give the same performance as a fully parallel version, with the exception of the overhead of layer tracking. When I run it, I get a time of about 5.8 seconds with -N4 , so the overhead is pretty small.

Now let's see what happens when we reduce the number of layers:

 | Layers | 55 | 40 | 30 | 20 | 10 | 5 | 3 | 1 | |--------+-----+-----+-----+-----+-----+-----+-----+------| | time | 5.5 | 5.6 | 5.7 | 5.4 | 7.0 | 8.9 | 9.8 | 10.2 | 

Note that at the lowest levels, little can be gained from parallel computing. This is mainly because the average depth of the tree is probably about 25 layers or so, so there are only a few calculations on 50 layers, many of which have strange, one-sided partitions, and they are certainly too small to parallelize. On the other hand, there seems to be no penalty for these extra par .

Meanwhile, there is an increase in the gain up to at least 20 layers, so the attempt to artificially limit the total number of sparks to 16 (for example, 4 or 5 upper layers) is a big loss.

+1
source

Source: https://habr.com/ru/post/957281/


All Articles