I am writing a game in Haskell, and my current skip in the user interface involves a lot of procedural geometry generation. I am currently focusing on determining the performance of one specific operation (C-ish pseudo-code):
Vec4f multiplier, addend; Vec4f vecList[]; for (int i = 0; i < count; i++) vecList[i] = vecList[i] * multiplier + addend;
That is, the standard multiple addition of four floating bots is a view ripe for SIMD optimization.
As a result, we get the OpenGL vertex buffer, so in the end it should be flushed to a flat array C. For the same reason, calculations should probably be performed on C 'float' types.
I was looking for either a library or my own idiomatic solution to quickly do this in Haskell, but every solution I came up with seems to get about 2% performance (i.e., 50 times slower) compared to C from GCC with the correct flags . Of course, I started with Haskell a couple of weeks ago, so my experience is limited, and thatβs why I come to you guys. Can any of you offer suggestions for faster implementation of Haskell or pointers to documentation on how to write high-performance Haskell code?
Firstly, the latest Haskell solution (hours about 12 seconds). I tried the "bang-patterns" material from this SO post , but that did not affect AFAICT. Replacing "multAdd" with "(\ iv β v * 4)" meant that the execution time was reduced to 1.9 seconds, so the bitwise substance (and therefore problems with automatic optimization) did not seem to be too guilty.
{-# LANGUAGE BangPatterns #-} {-# OPTIONS_GHC -O2 -fvia-C -optc-O3 -fexcess-precision -optc-march=native #-} import Data.Vector.Storable import qualified Data.Vector.Storable as V import Foreign.C.Types import Data.Bits repCount = 10000 arraySize = 20000 a = fromList $ [0.2::CFloat, 0.1, 0.6, 1.0] m = fromList $ [0.99::CFloat, 0.7, 0.8, 0.6] multAdd :: Int -> CFloat -> CFloat multAdd !i !v = v * (m ! (i .&. 3)) + (a ! (i .&. 3)) multList :: Int -> Vector CFloat -> Vector CFloat multList !count !src | count <= 0 = src | otherwise = multList (count-1) $ V.imap multAdd src main = do print $ Data.Vector.Storable.sum $ multList repCount $ Data.Vector.Storable.replicate (arraySize*4) (0::CFloat)
Here's what I have in C. There are several #ifdefs in the code that don't allow you to compile it directly; scroll down for the test driver.
#include <stdio.h>
This script will compile and run tests using a number of gcc flag combinations. Best performance was achieved with cmath-64-native-O3-limit-vector-nocopy on my system, taking 0.22 seconds.
import System.Process import GHC.IOBase cBase = ("cmath", "gcc mult.c -ggdb --std=c99 -DM=10000 -DN=20000") cOptions = [ [("32", "-m32"), ("64", "-m64")], [("generic", ""), ("native", "-march=native -msse4")], [("O1", "-O1"), ("O2", "-O2"), ("O3", "-O3")], [("restrict", "-DMAYBE_RESTRICT=__restrict__"), ("norestrict", "-DMAYBE_RESTRICT=")], [("vector", "-DVECTOR"), ("scalar", "")], [("copy", "-DCOPY"), ("nocopy", "")] ] -- Fold over the Cartesian product of the double list. Probably a Prelude function -- or two that does this, but hey. The 'perm' referred to permutations until I realized -- that this wasn't actually doing permutations. ' permfold :: (a -> a -> a) -> a -> [[a]] -> [a] permfold fz [] = [z] permfold fz (x:xs) = concat $ map (\a -> (permfold f (fza) xs)) x prepCmd :: (String, String) -> (String, String) -> (String, String) prepCmd (name, cmd) (namea, cmda) = (name ++ "-" ++ namea, cmd ++ " " ++ cmda) runCCmd name compileCmd = do res <- system (compileCmd ++ " -DNAME=\\\"" ++ name ++ "\\\" -o " ++ name) if res == ExitSuccess then do system ("./" ++ name) return () else putStrLn $ name ++ " did not compile" main = do mapM_ (uncurry runCCmd) $ permfold prepCmd cBase cOptions