Space explosion when folding over manufacturers / analyzers in Haskell

Question

Space explosion when folding over manufacturers / analyzers in Haskell

Suppose I have a module like this:

module Explosion where import Pipes.Parse (foldAll, Parser, Producer) import Pipes.ByteString (ByteString, fromLazy) import Pipes.Aeson (DecodingError) import Pipes.Aeson.Unchecked (decoded) import Data.List (intercalate) import Data.ByteString.Lazy.Char8 (pack) import Lens.Family (view) import Lens.Family.State.Strict (zoom) produceString :: Producer ByteString IO () produceString = fromLazy $ pack $ intercalate " " $ map show [1..1000000] produceInts :: Producer Int IO (Either (DecodingError, Producer ByteString IO ()) ()) produceInts = view decoded produceString produceInts' :: Producer Int IO () produceInts' = produceInts >> return () parseBiggest :: Parser ByteString IO Int parseBiggest = zoom decoded (foldAll max 0 id)

The "produString" function is a bytestring producer, and I'm interested in compiling parsing on it to get some kind of result.

The following two programs show various ways to solve the problem of finding the maximum value in a byte table by analyzing it as a series of JSON ints.

Program 1:

 module Main where import Explosion (produceInts') import Pipes.Prelude (fold) main :: IO () main = do biggest <- fold max 0 id produceInts' print $ show biggest

Program 2:

 module Main where import Explosion (parseBiggest, produceString) import Pipes.Parse (evalStateT) main :: IO () main = do biggest <- evalStateT parseBiggest produceString print $ show biggest

Unfortunately, both programs consume about 200 MB of shared memory, when I project them, the problem is, I was hoping that using streaming parsers would solve it. The first program spends most of its time and memory (> 70%) in (^.) On Lens.Samily, and the second spends it on fmap , called zoom from Lens.Family.State.Strict. Usage schedules are shown below. Both programs spend about 70% of their time collecting garbage.

Am I doing something wrong? Is the Prelude max function not strong enough? I can’t say if library functions are bad, or if I use the library incorrectly! (Probably the last one.)

For completeness, here's a git repo that you can clone and run cabal install if you want to see what I say first hand, and here is the memory usage of two programs:

memory usage of program 1 memory usage of program 2

+6

stream memory haskell reduce haskell-pipes

immutablestate May 19, '14 at 13:26

source share

1 answer

Gabriel gonzalez · Accepted Answer · 2014-05-19T21:04:37+0000

Wrapping a strict byte string in one yield does not make it lazy. You need to get smaller pieces in order to get any streaming behavior.

Edit: I found an error. pipes-aeson internally uses the consecutively function defined as follows:

 consecutively parser = step where step p0 = do (mr, p1) <- lift $ S.runStateT atEndOfBytes (p0 >-> PB.dropWhile B.isSpaceWord8) case mr of Just r -> return (Right r) Nothing -> do (ea, p2) <- lift (S.runStateT parser p1) case ea of Left e -> return (Left (e, p2)) Right a -> yield a >> step p2

The problematic line is the number with PB.dropWhile . This adds a quadratic explosion proportional to the number of elements analyzed.

What happens is that the pipe that goes through this calculation accumulates a new cat channel downstream after each parsing. Thus, after N analyzes, you get an N cat pipe, which adds O (N) overhead to each element being analyzed.

I created a Github issue to fix this. pipes-aeson supported by Renzo, and he fixed this problem earlier.

Edit: I sent a transfer request to fix the second problem (you had to use intercalate for lazy bytes), Now the program runs in a constant 5KB space for both versions:

Space explosion when folding over manufacturers / analyzers in Haskell

More articles: