Haskell Vector performance compared to Scala

Question

Haskell Vector performance compared to Scala

I have very simple code in Haskell and Scala. This code is designed to work in a very narrow cycle, so performance matters. The problem is that Haskell is about 10 times slower than Scala. Here is the Haskell code.

{-# LANGUAGE BangPatterns #-} import qualified Data.Vector.Unboxed as VU newtype AffineTransform = AffineTransform {get :: (VU.Vector Double)} deriving (Show) {-# INLINE runAffineTransform #-} runAffineTransform :: AffineTransform -> (Double, Double) -> (Double, Double) runAffineTransform affTr (!x, !y) = (get affTr `VU.unsafeIndex` 0 * x + get affTr `VU.unsafeIndex` 1 * y + get affTr `VU.unsafeIndex` 2, get affTr `VU.unsafeIndex` 3 * x + get affTr `VU.unsafeIndex` 4 * y + get affTr `VU.unsafeIndex` 5) testAffineTransformSpeed :: AffineTransform -> Int -> (Double, Double) testAffineTransformSpeed affTr count = go count (0.5, 0.5) where go :: Int -> (Double, Double) -> (Double, Double) go 0 res = res go !n !res = go (n-1) (runAffineTransform affTr res)

What else can be done to improve this code?

+4

performance scala haskell

Aurimas Aug 22 '13 at 3:59

source share

2 answers

I defined the following type of strict / unpacked pairs:

 import System.Random.MWC -- for later import Control.DeepSeq data SP = SP { one :: {-# UNPACK #-} !Double , two :: {-# UNPACK #-} !Double } deriving Show instance NFData SP where rnf p = rnf (one p) `seq` rnf (two p) `seq` ()

and replaced it in the runAffineTransform function:

 runAffineTransform2 :: AffineTransform -> SP -> SP runAffineTransform2 affTr !(SP xy) = SP ( get affTr `U.unsafeIndex` 0 * x + get affTr `U.unsafeIndex` 1 * y + get affTr `U.unsafeIndex` 2 ) ( get affTr `U.unsafeIndex` 3 * x + get affTr `U.unsafeIndex` 4 * y + get affTr `U.unsafeIndex` 5 ) {-# INLINE runAffineTransform2 #-}

then ran this test suite:

 main :: IO () main = do g <- create zs <- fmap (AffineTransform . U.fromList) (replicateM 100000 (uniformR (0 :: Double, 1) g)) let myConfig = defaultConfig { cfgPerformGC = ljust True } defaultMainWith myConfig (return ()) [ bench "yours" $ nf (testAffineTransformSpeed zs) 10 , bench "mine" $ nf (testAffineTransformSpeed2 zs) 10 ]

Compiled with -O2 and ran, and observed some (~ 4x) acceleration:

 benchmarking yours mean: 257.4559 ns, lb 256.2492 ns, ub 258.9761 ns, ci 0.950 std dev: 6.889905 ns, lb 5.688330 ns, ub 8.839753 ns, ci 0.950 found 5 outliers among 100 samples (5.0%) 3 (3.0%) high mild 2 (2.0%) high severe variance introduced by outliers: 20.944% variance is moderately inflated by outliers benchmarking mine mean: 69.56408 ns, lb 69.29910 ns, ub 69.86838 ns, ci 0.950 std dev: 1.448874 ns, lb 1.261444 ns, ub 1.718074 ns, ci 0.950 found 4 outliers among 100 samples (4.0%) 4 (4.0%) high mild variance introduced by outliers: 14.190% variance is moderately inflated by outliers

The full code is in the gist here .

EDIT

I also posted a results report here .

+9

jtobin Aug 22 '13 at 4:36

source share

Daniel Fischer · Accepted Answer · 2013-08-22T12:14:57+0000

The main problem is that

 runAffineTransform affTr (!x, !y) = (get affTr `VU.unsafeIndex` 0 * x + get affTr `VU.unsafeIndex` 1 * y + get affTr `VU.unsafeIndex` 2, get affTr `VU.unsafeIndex` 3 * x + get affTr `VU.unsafeIndex` 4 * y + get affTr `VU.unsafeIndex` 5)

creates a couple of tricks. Components are not evaluated when runAffineTransform called; they remain discontinuities until any consumer requires evaluation.

 testAffineTransformSpeed affTr count = go count (0.5, 0.5) where go :: Int -> (Double, Double) -> (Double, Double) go 0 res = res go !n !res = go (n-1) (runAffineTransform affTr res)

is not that consumer, bang on res evaluates only its external constructor (,) , and you get the result

 runAffineTransform affTr (runAffineTrasform affTr (runAffineTransform affTr (...)))

which is evaluated only at the end, when finally a normal form is required.

If you immediately forcibly evaluate the components of the result,

 runAffineTransform affTr (!x, !y) = case ( get affTr `U.unsafeIndex` 0 * x + get affTr `U.unsafeIndex` 1 * y + get affTr `U.unsafeIndex` 2 , get affTr `U.unsafeIndex` 3 * x + get affTr `U.unsafeIndex` 4 * y + get affTr `U.unsafeIndex` 5 ) of (!a,!b) -> (a,b)

and let it be built in, the main difference from jtobin using a custom strict unboxed Double# pair is that for the loop in testAffineTransformSpeed you get one initial iteration using the Double argument in the box, and at the end the result components are put into the box, which adds some constant overhead (something about 5 nanoseconds per cycle on my box). The main part of the loop takes the arguments Int# and two Double# in both cases, and the body of the loop is identical, with the exception of boxing, when n = 0 reached.

Of course, forced immediate evaluation of components using the unpacked strict pair type is better.

Haskell Vector performance compared to Scala

More articles: