The main problem is that
runAffineTransform affTr (!x, !y) = (get affTr `VU.unsafeIndex` 0 * x + get affTr `VU.unsafeIndex` 1 * y + get affTr `VU.unsafeIndex` 2, get affTr `VU.unsafeIndex` 3 * x + get affTr `VU.unsafeIndex` 4 * y + get affTr `VU.unsafeIndex` 5)
creates a couple of tricks. Components are not evaluated when runAffineTransform called; they remain discontinuities until any consumer requires evaluation.
testAffineTransformSpeed affTr count = go count (0.5, 0.5) where go :: Int -> (Double, Double) -> (Double, Double) go 0 res = res go !n !res = go (n-1) (runAffineTransform affTr res)
is not that consumer, bang on res evaluates only its external constructor (,) , and you get the result
runAffineTransform affTr (runAffineTrasform affTr (runAffineTransform affTr (...)))
which is evaluated only at the end, when finally a normal form is required.
If you immediately forcibly evaluate the components of the result,
runAffineTransform affTr (!x, !y) = case ( get affTr `U.unsafeIndex` 0 * x + get affTr `U.unsafeIndex` 1 * y + get affTr `U.unsafeIndex` 2 , get affTr `U.unsafeIndex` 3 * x + get affTr `U.unsafeIndex` 4 * y + get affTr `U.unsafeIndex` 5 ) of (!a,!b) -> (a,b)
and let it be built in, the main difference from jtobin using a custom strict unboxed Double# pair is that for the loop in testAffineTransformSpeed you get one initial iteration using the Double argument in the box, and at the end the result components are put into the box, which adds some constant overhead (something about 5 nanoseconds per cycle on my box). The main part of the loop takes the arguments Int# and two Double# in both cases, and the body of the loop is identical, with the exception of boxing, when n = 0 reached.
Of course, forced immediate evaluation of components using the unpacked strict pair type is better.