The problem was that the garbage collector must traverse modified pointer arrays (“boxed arrays”) that are looking for pointers to data that might be ready to be freed. Arranged, modified arrays are the main mechanism for implementing a hash table, so in a particular structure there was a problem of GC bypass. This is typical for many languages. Symptom is excessive garbage collection (up to 95% of the time spent in the GC).
The fix was to implement "map marking" in the GC for mutable pointer arrays that occurred at the end of 2009. You should not see excessive GC when using mutable pointer arrays in Haskell. In simple tests, adding a hash table for large hashes improved by 10 times.
Note that the GC walking problem does not affect purely functional structures , as well as non-decompressed arrays (for example, most data are parallel arrays or vector- like arrays in Haskell. It also does not affect hashtables stored on the C heap (for example, judy ) This means that it did not affect everyday Haskellers without using real hash tables.
If you use hashtables in Haskell, you should not be observing any problem now. Here, for example, there is a simple hash table that inserts 10 million ints into the hash. I will do benchmarking as the source quote does not contain any code or tests.
import Control.Monad import qualified Data.HashTable as H import System.Environment main = do [size] <- fmap (fmap read) getArgs m <- H.new (==) H.hashInt forM_ [1..size] $ \n -> H.insert mnn v <- H.lookup m 100 print v
With GHC 6.10.2, before patching, insert 10M ints:
$ time ./A 10000000 +RTS -s ... 47s.
With GHC 6.13 after correction:
./A 10000000 +RTS -s ... 8s
Increase default heap area:
./A +RTS -s -A2G ... 2.3s
Avoiding hash tables and using IntMap:
import Control.Monad import Data.List import qualified Data.IntMap as I import System.Environment main = do [size] <- fmap (fmap read) getArgs let k = foldl' (\mn -> I.insert nnm) I.empty [1..size] print $ I.lookup 100 k
And we get:
$ time ./A 10000000 +RTS -s ./A 10000000 +RTS -s 6s
Or, alternatively, using a judy array (which is a Haskell wrapper calling C code through an external function interface):
import Control.Monad import Data.List import System.Environment import qualified Data.Judy as J main = do [size] <- fmap (fmap read) getArgs j <- J.new :: IO (J.JudyL Int) forM_ [1..size] $ \n -> J.insert (fromIntegral n) nj print =<< J.lookup 100 j
By running this
$ time ./A 10000000 +RTS -s ... 2.1s
So, as you can see, the GC issue with hashtables has been fixed, and there have always been other libraries and data structures that were perfect. So this is not a problem.
Note: as of 2013, you probably should just use the hashtables package, which supports a range of mutable hash tables natively.