I am trying to further optimize my parser and pipe-attoparsec storage, but the memory consumption problem is lower.
This account is parser.hs
{-# LANGUAGE RankNTypes #-} {-# LANGUAGE OverloadedStrings #-} {-# LANGUAGE NoImplicitPrelude #-} import Protolude hiding (for) import Data.Hashable import Data.IntMap.Strict (IntMap) import Data.Vector (Vector) import Pipes import Pipes.Parse import Pipes.Safe (MonadSafe, runSafeT) import qualified Data.Attoparsec.ByteString.Char8 as AB import qualified Data.IntMap.Strict as IM import qualified Data.Vector as Vector import qualified Pipes.Attoparsec as PA import qualified Pipes.ByteString as PB import qualified Pipes.Safe.Prelude as PSP -- accountid|account-name|contractid|code data AccountLine = AccountLine { _accountId :: !ByteString, _accountName :: !ByteString, _accountContractId :: !ByteString, _accountCode :: !Word32 } deriving (Show) type MapCodetoAccountIdIdx = IntMap Int data Accounts = Accounts { _accountIds :: !(Vector ByteString), _cache :: !(IntMap Int), _accountCodes :: !MapCodetoAccountIdIdx } deriving (Show) parseAccountLine :: AB.Parser AccountLine parseAccountLine = AccountLine <$> getSubfield <* delim <*> getSubfield <* delim <*> getSubfield <* delim <*> AB.decimal <* AB.endOfLine where getSubfield = AB.takeTill (== '|') delim = AB.char '|' -- aempty :: Accounts aempty = Accounts Vector.empty IM.empty IM.empty aappend :: Accounts -> AccountLine -> Accounts aappend (Accounts ids a2i cps) (AccountLine aid an cid cp) = case IM.lookup (hash aid) a2i of Nothing -> Accounts (Vector.snoc ids (toS aid)) (IM.insert (hash aid) (length ids) a2i) (IM.insert (fromIntegral cp) (length ids) cps) Just idx -> Accounts ids a2i (IM.insert (fromIntegral cp) idx cps) foldAccounts :: (Monad m) => Parser AccountLine m Accounts foldAccounts = foldAll aappend aempty identity readByteStringFile :: (MonadSafe m) => FilePath -> Producer' ByteString m () readByteStringFile file = PSP.withFile file ReadMode PB.fromHandle accountLines :: Text -> MonadSafe m => Producer AccountLine m (Either (PA.ParsingError, Producer ByteString m ()) ()) accountLines filename = PA.parsed parseAccountLine (readByteStringFile (toS filename)) main :: IO () main = do [filename] <- getArgs x <- runSafeT $ runEffect $ Pipes.Parse.evalStateT foldAccounts (accountLines (toS filename)) print $ sizes x sizes :: Accounts -> (Int, Int, Int) sizes (Accounts aid xxx acp) = (Vector.length aid, IM.size xxx, IM.size acp)
Compiled with GHC 8.0.2 ( stack ghc -- -O2 -rtsopts -threaded -Wall account-parser.hs )
I can not use the memory below. I have to quickly search up, therefore, IntMaps. The file is about 20 MB (and not efficient). Most of the data should be able to match 5 MB.
$ ./account-parser /tmp/accounts +RTS -s (5837,5837,373998) 1,631,040,680 bytes allocated in the heap 221,765,464 bytes copied during GC 41,709,048 bytes maximum residency (13 sample(s)) 2,512,560 bytes maximum slop 82 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 2754 colls, 0 par 0.105s 0.142s 0.0001s 0.0002s Gen 1 13 colls, 0 par 0.066s 0.074s 0.0057s 0.0216s TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1) SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 0.000s ( 0.001s elapsed) MUT time 0.324s ( 0.298s elapsed) GC time 0.171s ( 0.216s elapsed) EXIT time 0.000s ( 0.005s elapsed) Total time 0.495s ( 0.520s elapsed) Alloc rate 5,026,660,297 bytes per MUT second Productivity 65.5% of total user, 58.4% of total elapsed gc_alloc_block_sync: 0 whitehole_spin: 0 gen[0].sync: 0 gen[1].sync: 0
And profile:
