Using GHC 7.0.3, I can reproduce your heavy GC behavior:
$ time ./A +RTS -s %GC time 92.9% (92.9% elapsed) ./A +RTS -s 7.24s user 0.04s system 99% cpu 7.301 total
I spent 10 minutes going through the program. Here is what I did, in order:
- Set the flag GHC -H, increasing the limits in GC
- Check unpacking
- Investment improvement
- Adjust the selection area of ββthe first generation
The result is 10x acceleration and GC about 45% of the time.
In order to use the magic -H GHC magic flag, we can slightly reduce the execution time:
$ time ./A +RTS -s -H %GC time 74.3% (75.3% elapsed) ./A +RTS -s -H 2.34s user 0.04s system 99% cpu 2.392 total
Not bad!
UNPACK pragmas on Tree nodes will do nothing, so delete them.
Attaching update reduces execution time:
./A +RTS -s -H 1.84s user 0.04s system 99% cpu 1.883 total
like insert height
./A +RTS -s -H 1.74s user 0.03s system 99% cpu 1.777 total
So, although it's fast, the GC is still dominant - since we are testing the distribution, after all. One thing we can do is increase the size of the first generator:
$ time ./A +RTS -s -A200M %GC time 45.1% (40.5% elapsed) ./A +RTS -s -A200M 0.71s user 0.16s system 99% cpu 0.872 total
And increasing the rollout threshold, as JohnL suggested, helps a bit,
./A +RTS -s -A100M 0.74s user 0.09s system 99% cpu 0.826 total
which is 10 times faster than we started? Not bad.
Using ghc-gc-tune , you can see the runtime as a function of -A and -H ,

Interestingly, the best working times use very large -A values, for example.
$ time ./A +RTS -A500M ./A +RTS -A500M 0.49s user 0.28s system 99% cpu 0.776s