Mathematical Characterization of Haskell with Multiple Additions

I am writing a game in Haskell, and my current skip in the user interface involves a lot of procedural geometry generation. I am currently focusing on determining the performance of one specific operation (C-ish pseudo-code):

Vec4f multiplier, addend; Vec4f vecList[]; for (int i = 0; i < count; i++) vecList[i] = vecList[i] * multiplier + addend; 

That is, the standard multiple addition of four floating bots is a view ripe for SIMD optimization.

As a result, we get the OpenGL vertex buffer, so in the end it should be flushed to a flat array C. For the same reason, calculations should probably be performed on C 'float' types.

I was looking for either a library or my own idiomatic solution to quickly do this in Haskell, but every solution I came up with seems to get about 2% performance (i.e., 50 times slower) compared to C from GCC with the correct flags . Of course, I started with Haskell a couple of weeks ago, so my experience is limited, and that’s why I come to you guys. Can any of you offer suggestions for faster implementation of Haskell or pointers to documentation on how to write high-performance Haskell code?

Firstly, the latest Haskell solution (hours about 12 seconds). I tried the "bang-patterns" material from this SO post , but that did not affect AFAICT. Replacing "multAdd" with "(\ iv β†’ v * 4)" meant that the execution time was reduced to 1.9 seconds, so the bitwise substance (and therefore problems with automatic optimization) did not seem to be too guilty.

 {-# LANGUAGE BangPatterns #-} {-# OPTIONS_GHC -O2 -fvia-C -optc-O3 -fexcess-precision -optc-march=native #-} import Data.Vector.Storable import qualified Data.Vector.Storable as V import Foreign.C.Types import Data.Bits repCount = 10000 arraySize = 20000 a = fromList $ [0.2::CFloat, 0.1, 0.6, 1.0] m = fromList $ [0.99::CFloat, 0.7, 0.8, 0.6] multAdd :: Int -> CFloat -> CFloat multAdd !i !v = v * (m ! (i .&. 3)) + (a ! (i .&. 3)) multList :: Int -> Vector CFloat -> Vector CFloat multList !count !src | count <= 0 = src | otherwise = multList (count-1) $ V.imap multAdd src main = do print $ Data.Vector.Storable.sum $ multList repCount $ Data.Vector.Storable.replicate (arraySize*4) (0::CFloat) 

Here's what I have in C. There are several #ifdefs in the code that don't allow you to compile it directly; scroll down for the test driver.

 #include <stdio.h> #include <stdlib.h> #include <time.h> typedef float v4fs __attribute__ ((vector_size (16))); typedef struct { float x, y, z, w; } Vector4; void setv4(v4fs *v, float x, float y, float z, float w) { float *a = (float*) v; a[0] = x; a[1] = y; a[2] = z; a[3] = w; } float sumv4(v4fs *v) { float *a = (float*) v; return a[0] + a[1] + a[2] + a[3]; } void vecmult(v4fs *MAYBE_RESTRICT s, v4fs *MAYBE_RESTRICT d, v4fs a, v4fs m) { for (int j = 0; j < N; j++) { d[j] = s[j] * m + a; } } void scamult(float *MAYBE_RESTRICT s, float *MAYBE_RESTRICT d, Vector4 a, Vector4 m) { for (int j = 0; j < (N*4); j+=4) { d[j+0] = s[j+0] * mx + ax; d[j+1] = s[j+1] * my + ay; d[j+2] = s[j+2] * mz + az; d[j+3] = s[j+3] * mw + aw; } } int main () { v4fs a, m; v4fs *s, *d; setv4(&a, 0.2, 0.1, 0.6, 1.0); setv4(&m, 0.99, 0.7, 0.8, 0.6); s = calloc(N, sizeof(v4fs)); d = s; double start = clock(); for (int i = 0; i < M; i++) { #ifdef COPY d = malloc(N * sizeof(v4fs)); #endif #ifdef VECTOR vecmult(s, d, a, m); #else Vector4 aa = *(Vector4*)(&a); Vector4 mm = *(Vector4*)(&m); scamult((float*)s, (float*)d, aa, mm); #endif #ifdef COPY free(s); s = d; #endif } double end = clock(); float sum = 0; for (int j = 0; j < N; j++) { sum += sumv4(s+j); } printf("%-50s %2.5f %f\n\n", NAME, (end - start) / (double) CLOCKS_PER_SEC, sum); } 

This script will compile and run tests using a number of gcc flag combinations. Best performance was achieved with cmath-64-native-O3-limit-vector-nocopy on my system, taking 0.22 seconds.

 import System.Process import GHC.IOBase cBase = ("cmath", "gcc mult.c -ggdb --std=c99 -DM=10000 -DN=20000") cOptions = [ [("32", "-m32"), ("64", "-m64")], [("generic", ""), ("native", "-march=native -msse4")], [("O1", "-O1"), ("O2", "-O2"), ("O3", "-O3")], [("restrict", "-DMAYBE_RESTRICT=__restrict__"), ("norestrict", "-DMAYBE_RESTRICT=")], [("vector", "-DVECTOR"), ("scalar", "")], [("copy", "-DCOPY"), ("nocopy", "")] ] -- Fold over the Cartesian product of the double list. Probably a Prelude function -- or two that does this, but hey. The 'perm' referred to permutations until I realized -- that this wasn't actually doing permutations. ' permfold :: (a -> a -> a) -> a -> [[a]] -> [a] permfold fz [] = [z] permfold fz (x:xs) = concat $ map (\a -> (permfold f (fza) xs)) x prepCmd :: (String, String) -> (String, String) -> (String, String) prepCmd (name, cmd) (namea, cmda) = (name ++ "-" ++ namea, cmd ++ " " ++ cmda) runCCmd name compileCmd = do res <- system (compileCmd ++ " -DNAME=\\\"" ++ name ++ "\\\" -o " ++ name) if res == ExitSuccess then do system ("./" ++ name) return () else putStrLn $ name ++ " did not compile" main = do mapM_ (uncurry runCCmd) $ permfold prepCmd cBase cOptions 
+26
performance math simd haskell
Jun 25 '10 at 4:28
source share
2 answers

Roman Leshchinsky replies:

Actually, the core looks basically normal to me. Using unsafeIndex instead of (!) Makes the program more than twice as fast ( see my answer above ). The lower program is much faster though (and cleaner, IMO). I suspect that the remaining difference between this and program C is related to the common GHC suck when it comes to floating point. HEAD creates better results with NCG and - msse2

First define a new Vec4 data type:

 {-# LANGUAGE BangPatterns #-} import Data.Vector.Storable import qualified Data.Vector.Storable as V import Foreign import Foreign.C.Types -- Define a 4 element vector type data Vec4 = Vec4 {-# UNPACK #-} !CFloat {-# UNPACK #-} !CFloat {-# UNPACK #-} !CFloat {-# UNPACK #-} !CFloat 

Make sure we can store it in an array

 instance Storable Vec4 where sizeOf _ = sizeOf (undefined :: CFloat) * 4 alignment _ = alignment (undefined :: CFloat) {-# INLINE peek #-} peek p = do a <- peekElemOff q 0 b <- peekElemOff q 1 c <- peekElemOff q 2 d <- peekElemOff q 3 return (Vec4 abcd) where q = castPtr p {-# INLINE poke #-} poke p (Vec4 abcd) = do pokeElemOff q 0 a pokeElemOff q 1 b pokeElemOff q 2 c pokeElemOff q 3 d where q = castPtr p 

Values ​​and methods of this type:

 a = Vec4 0.2 0.1 0.6 1.0 m = Vec4 0.99 0.7 0.8 0.6 add :: Vec4 -> Vec4 -> Vec4 {-# INLINE add #-} add (Vec4 abcd) (Vec4 a' b' c' d') = Vec4 (a+a') (b+b') (c+c') (d+d') mult :: Vec4 -> Vec4 -> Vec4 {-# INLINE mult #-} mult (Vec4 abcd) (Vec4 a' b' c' d') = Vec4 (a*a') (b*b') (c*c') (d*d') vsum :: Vec4 -> CFloat {-# INLINE vsum #-} vsum (Vec4 abcd) = a+b+c+d multList :: Int -> Vector Vec4 -> Vector Vec4 multList !count !src | count <= 0 = src | otherwise = multList (count-1) $ V.map (\v -> add (mult vm) a) src main = do print $ Data.Vector.Storable.sum $ Data.Vector.Storable.map vsum $ multList repCount $ Data.Vector.Storable.replicate arraySize (Vec4 0 0 0 0) repCount, arraySize :: Int repCount = 10000 arraySize = 20000 

With ghc 6.12.1, -O2 -fasm:

  • 1,752

With ghc HEAD (June 26), -O2 -fasm -msse2

  • 1,708

This seems like the most idiomatic way to write a Vec4 array and gets better performance (11 times faster than your original). (And that could be the benchmark for supporting GHC LLVM)

+11
Jun 28 '10 at 17:06
source share

Good, that’s better. 3.5s instead of 14s.

 {-# LANGUAGE BangPatterns #-} {- -- multiply-add of four floats, Vec4f multiplier, addend; Vec4f vecList[]; for (int i = 0; i < count; i++) vecList[i] = vecList[i] * multiplier + addend; -} import qualified Data.Vector.Storable as V import Data.Vector.Storable (Vector) import Data.Bits repCount, arraySize :: Int repCount = 10000 arraySize = 20000 a, m :: Vector Float a = V.fromList [0.2, 0.1, 0.6, 1.0] m = V.fromList [0.99, 0.7, 0.8, 0.6] multAdd :: Int -> Float -> Float multAdd iv = v * (m `V.unsafeIndex` (i .&. 3)) + (a `V.unsafeIndex` (i .&. 3)) go :: Int -> Vector Float -> Vector Float go ns | n <= 0 = s | otherwise = go (n-1) (fs) where f = V.imap multAdd main = print . V.sum $ go repCount v where v :: Vector Float v = V.replicate (arraySize * 4) 0 -- ^ a flattened Vec4f [] 

Which is better than it was:

 $ ghc -O2 --make A.hs [1 of 1] Compiling Main ( A.hs, Ao ) Linking A ... $ time ./A 516748.13 ./A 3.58s user 0.01s system 99% cpu 3.593 total 

multAdd just compiles:

  case readFloatOffAddr# rb_aVn (word2Int# (and# (int2Word# sc1_s1Yx) __word 3)) realWorld# of _ { (# s25_X1Tb, x4_X1Te #) -> case readFloatOffAddr# rb11_X118 (word2Int# (and# (int2Word# sc1_s1Yx) __word 3)) realWorld# of _ { (# s26_X1WO, x5_X20B #) -> case writeFloatOffAddr# @ RealWorld a17_s1Oe sc3_s1Yz (plusFloat# (timesFloat# x3_X1Qz x4_X1Te) x5_X20B) 

However, you make 4 elements at a time multiplied by C code, so we will need to do this directly, and not pretend to be looping and disguising. GCC is probably also rolling out a loop.

Thus, in order to get identical performance, we will need to multiply the vector (a bit more complicated, perhaps through the LLVM backend) and expand the loop (possibly merging). I will step back from Roman here to see if there are other obvious things.

One idea might be to use vector Vec4 rather than alignment.

+5
Jun 25 '10 at 17:09
source share



All Articles