Mathematical Characterization of Haskell with Multiple Additions

Question

Mathematical Characterization of Haskell with Multiple Additions

I am writing a game in Haskell, and my current skip in the user interface involves a lot of procedural geometry generation. I am currently focusing on determining the performance of one specific operation (C-ish pseudo-code):

Vec4f multiplier, addend; Vec4f vecList[]; for (int i = 0; i < count; i++) vecList[i] = vecList[i] * multiplier + addend;

That is, the standard multiple addition of four floating bots is a view ripe for SIMD optimization.

As a result, we get the OpenGL vertex buffer, so in the end it should be flushed to a flat array C. For the same reason, calculations should probably be performed on C 'float' types.

I was looking for either a library or my own idiomatic solution to quickly do this in Haskell, but every solution I came up with seems to get about 2% performance (i.e., 50 times slower) compared to C from GCC with the correct flags . Of course, I started with Haskell a couple of weeks ago, so my experience is limited, and that’s why I come to you guys. Can any of you offer suggestions for faster implementation of Haskell or pointers to documentation on how to write high-performance Haskell code?

Firstly, the latest Haskell solution (hours about 12 seconds). I tried the "bang-patterns" material from this SO post , but that did not affect AFAICT. Replacing "multAdd" with "(\ iv → v * 4)" meant that the execution time was reduced to 1.9 seconds, so the bitwise substance (and therefore problems with automatic optimization) did not seem to be too guilty.

 {-# LANGUAGE BangPatterns #-} {-# OPTIONS_GHC -O2 -fvia-C -optc-O3 -fexcess-precision -optc-march=native #-} import Data.Vector.Storable import qualified Data.Vector.Storable as V import Foreign.C.Types import Data.Bits repCount = 10000 arraySize = 20000 a = fromList $ [0.2::CFloat, 0.1, 0.6, 1.0] m = fromList $ [0.99::CFloat, 0.7, 0.8, 0.6] multAdd :: Int -> CFloat -> CFloat multAdd !i !v = v * (m ! (i .&. 3)) + (a ! (i .&. 3)) multList :: Int -> Vector CFloat -> Vector CFloat multList !count !src | count <= 0 = src | otherwise = multList (count-1) $ V.imap multAdd src main = do print $ Data.Vector.Storable.sum $ multList repCount $ Data.Vector.Storable.replicate (arraySize*4) (0::CFloat)

Here's what I have in C. There are several #ifdefs in the code that don't allow you to compile it directly; scroll down for the test driver.

 #include <stdio.h> #include <stdlib.h> #include <time.h> typedef float v4fs __attribute__ ((vector_size (16))); typedef struct { float x, y, z, w; } Vector4; void setv4(v4fs *v, float x, float y, float z, float w) { float *a = (float*) v; a[0] = x; a[1] = y; a[2] = z; a[3] = w; } float sumv4(v4fs *v) { float *a = (float*) v; return a[0] + a[1] + a[2] + a[3]; } void vecmult(v4fs *MAYBE_RESTRICT s, v4fs *MAYBE_RESTRICT d, v4fs a, v4fs m) { for (int j = 0; j < N; j++) { d[j] = s[j] * m + a; } } void scamult(float *MAYBE_RESTRICT s, float *MAYBE_RESTRICT d, Vector4 a, Vector4 m) { for (int j = 0; j < (N*4); j+=4) { d[j+0] = s[j+0] * mx + ax; d[j+1] = s[j+1] * my + ay; d[j+2] = s[j+2] * mz + az; d[j+3] = s[j+3] * mw + aw; } } int main () { v4fs a, m; v4fs *s, *d; setv4(&a, 0.2, 0.1, 0.6, 1.0); setv4(&m, 0.99, 0.7, 0.8, 0.6); s = calloc(N, sizeof(v4fs)); d = s; double start = clock(); for (int i = 0; i < M; i++) { #ifdef COPY d = malloc(N * sizeof(v4fs)); #endif #ifdef VECTOR vecmult(s, d, a, m); #else Vector4 aa = *(Vector4*)(&a); Vector4 mm = *(Vector4*)(&m); scamult((float*)s, (float*)d, aa, mm); #endif #ifdef COPY free(s); s = d; #endif } double end = clock(); float sum = 0; for (int j = 0; j < N; j++) { sum += sumv4(s+j); } printf("%-50s %2.5f %f\n\n", NAME, (end - start) / (double) CLOCKS_PER_SEC, sum); }

This script will compile and run tests using a number of gcc flag combinations. Best performance was achieved with cmath-64-native-O3-limit-vector-nocopy on my system, taking 0.22 seconds.

 import System.Process import GHC.IOBase cBase = ("cmath", "gcc mult.c -ggdb --std=c99 -DM=10000 -DN=20000") cOptions = [ [("32", "-m32"), ("64", "-m64")], [("generic", ""), ("native", "-march=native -msse4")], [("O1", "-O1"), ("O2", "-O2"), ("O3", "-O3")], [("restrict", "-DMAYBE_RESTRICT=__restrict__"), ("norestrict", "-DMAYBE_RESTRICT=")], [("vector", "-DVECTOR"), ("scalar", "")], [("copy", "-DCOPY"), ("nocopy", "")] ] -- Fold over the Cartesian product of the double list. Probably a Prelude function -- or two that does this, but hey. The 'perm' referred to permutations until I realized -- that this wasn't actually doing permutations. ' permfold :: (a -> a -> a) -> a -> [[a]] -> [a] permfold fz [] = [z] permfold fz (x:xs) = concat $ map (\a -> (permfold f (fza) xs)) x prepCmd :: (String, String) -> (String, String) -> (String, String) prepCmd (name, cmd) (namea, cmda) = (name ++ "-" ++ namea, cmd ++ " " ++ cmda) runCCmd name compileCmd = do res <- system (compileCmd ++ " -DNAME=\\\"" ++ name ++ "\\\" -o " ++ name) if res == ExitSuccess then do system ("./" ++ name) return () else putStrLn $ name ++ " did not compile" main = do mapM_ (uncurry runCCmd) $ permfold prepCmd cBase cOptions

+26

performance math simd haskell

Steven Robertson Jun 25 '10 at 4:28

source share

2 answers

Good, that’s better. 3.5s instead of 14s.

 {-# LANGUAGE BangPatterns #-} {- -- multiply-add of four floats, Vec4f multiplier, addend; Vec4f vecList[]; for (int i = 0; i < count; i++) vecList[i] = vecList[i] * multiplier + addend; -} import qualified Data.Vector.Storable as V import Data.Vector.Storable (Vector) import Data.Bits repCount, arraySize :: Int repCount = 10000 arraySize = 20000 a, m :: Vector Float a = V.fromList [0.2, 0.1, 0.6, 1.0] m = V.fromList [0.99, 0.7, 0.8, 0.6] multAdd :: Int -> Float -> Float multAdd iv = v * (m `V.unsafeIndex` (i .&. 3)) + (a `V.unsafeIndex` (i .&. 3)) go :: Int -> Vector Float -> Vector Float go ns | n <= 0 = s | otherwise = go (n-1) (fs) where f = V.imap multAdd main = print . V.sum $ go repCount v where v :: Vector Float v = V.replicate (arraySize * 4) 0 -- ^ a flattened Vec4f []

Which is better than it was:

 $ ghc -O2 --make A.hs [1 of 1] Compiling Main ( A.hs, Ao ) Linking A ... $ time ./A 516748.13 ./A 3.58s user 0.01s system 99% cpu 3.593 total

multAdd just compiles:

  case readFloatOffAddr# rb_aVn (word2Int# (and# (int2Word# sc1_s1Yx) __word 3)) realWorld# of _ { (# s25_X1Tb, x4_X1Te #) -> case readFloatOffAddr# rb11_X118 (word2Int# (and# (int2Word# sc1_s1Yx) __word 3)) realWorld# of _ { (# s26_X1WO, x5_X20B #) -> case writeFloatOffAddr# @ RealWorld a17_s1Oe sc3_s1Yz (plusFloat# (timesFloat# x3_X1Qz x4_X1Te) x5_X20B)

However, you make 4 elements at a time multiplied by C code, so we will need to do this directly, and not pretend to be looping and disguising. GCC is probably also rolling out a loop.

Thus, in order to get identical performance, we will need to multiply the vector (a bit more complicated, perhaps through the LLVM backend) and expand the loop (possibly merging). I will step back from Roman here to see if there are other obvious things.

One idea might be to use vector Vec4 rather than alignment.

+5

Don Stewart Jun 25 '10 at 17:09

source share

Don Stewart · Accepted Answer · 2010-06-28 17:06

Roman Leshchinsky replies:

Actually, the core looks basically normal to me. Using unsafeIndex instead of (!) Makes the program more than twice as fast ( see my answer above ). The lower program is much faster though (and cleaner, IMO). I suspect that the remaining difference between this and program C is related to the common GHC suck when it comes to floating point. HEAD creates better results with NCG and - msse2

First define a new Vec4 data type:

 {-# LANGUAGE BangPatterns #-} import Data.Vector.Storable import qualified Data.Vector.Storable as V import Foreign import Foreign.C.Types -- Define a 4 element vector type data Vec4 = Vec4 {-# UNPACK #-} !CFloat {-# UNPACK #-} !CFloat {-# UNPACK #-} !CFloat {-# UNPACK #-} !CFloat

Make sure we can store it in an array

 instance Storable Vec4 where sizeOf _ = sizeOf (undefined :: CFloat) * 4 alignment _ = alignment (undefined :: CFloat) {-# INLINE peek #-} peek p = do a <- peekElemOff q 0 b <- peekElemOff q 1 c <- peekElemOff q 2 d <- peekElemOff q 3 return (Vec4 abcd) where q = castPtr p {-# INLINE poke #-} poke p (Vec4 abcd) = do pokeElemOff q 0 a pokeElemOff q 1 b pokeElemOff q 2 c pokeElemOff q 3 d where q = castPtr p

Values and methods of this type:

 a = Vec4 0.2 0.1 0.6 1.0 m = Vec4 0.99 0.7 0.8 0.6 add :: Vec4 -> Vec4 -> Vec4 {-# INLINE add #-} add (Vec4 abcd) (Vec4 a' b' c' d') = Vec4 (a+a') (b+b') (c+c') (d+d') mult :: Vec4 -> Vec4 -> Vec4 {-# INLINE mult #-} mult (Vec4 abcd) (Vec4 a' b' c' d') = Vec4 (a*a') (b*b') (c*c') (d*d') vsum :: Vec4 -> CFloat {-# INLINE vsum #-} vsum (Vec4 abcd) = a+b+c+d multList :: Int -> Vector Vec4 -> Vector Vec4 multList !count !src | count <= 0 = src | otherwise = multList (count-1) $ V.map (\v -> add (mult vm) a) src main = do print $ Data.Vector.Storable.sum $ Data.Vector.Storable.map vsum $ multList repCount $ Data.Vector.Storable.replicate arraySize (Vec4 0 0 0 0) repCount, arraySize :: Int repCount = 10000 arraySize = 20000

With ghc 6.12.1, -O2 -fasm:

1,752

With ghc HEAD (June 26), -O2 -fasm -msse2

1,708

This seems like the most idiomatic way to write a Vec4 array and gets better performance (11 times faster than your original). (And that could be the benchmark for supporting GHC LLVM)

Mathematical Characterization of Haskell with Multiple Additions

More articles: