Read large lines in a huge file without buffering

Question

Read large lines in a huge file without buffering

I was wondering if there is an easy way to get strings one at a time from a file, without end up loading the entire file into memory. I would like to fold over the lines with the attoparsec parser. I tried using Data.Text.Lazy.IO with hGetLine , and this breaks my memory. I read later that it ultimately downloads the entire file.

I also tried using pipe-text with folds and view lines :

 s <- Pipes.sum $ folds (\i _ -> (i+1)) 0 id (view Text.lines (Text.fromHandle handle)) print s

just to count the number of lines, and it seems to do some elusive things “hGetChunk: invalid argument (invalid byte sequence)”, and it takes 11 minutes when wc -l takes 1 minute. I heard that pipe text can have problems with giant lines? (Each row is about 1 GB)

I am really open to any suggestions, I can not find much search, except for newcomers to readLine .

Thanks!

+5

haskell haskell-pipes

Charles Durham Mar 08 '17 at 15:45

source share

2 answers

This is probably the easiest way, like a fold over a decoded text stream.

 {-#LANGUAGE BangPatterns #-} import Pipes import qualified Pipes.Prelude as P import qualified Pipes.ByteString as PB import qualified Pipes.Text.Encoding as PT import qualified Control.Foldl as L import qualified Control.Foldl.Text as LT main = do n <- L.purely P.fold (LT.count '\n') $ void $ PT.decodeUtf8 PB.stdin print n

It takes about 14% longer than wc -l for the file I created, which was just long lines of commas and numbers. IO should be done using Pipes.ByteString , as indicated in the documentation, the rest are various amenities.

You can match the attoparsec parser for each line, which has different view lines , but keep in mind that the attoparsec parser can accumulate all the text as you like, and this can be a great idea for 1 gigabyte piece of text, If there is a repeating digit on each line ( for example, word-separated numbers), you can use Pipes.Attoparsec.parsed to stream them.

+3

Michael Mar 08 '17 at 19:42

source share

Michael snoyman · Accepted Answer · 2017-03-08T16:17:22+0000

The following code uses Conduit and will:

UTF8 standard input
Run the lineC combinator if more data is available
For each row, simply yield value of 1 and discard the contents of the row without immediately reading the entire row in memory
Summarize 1 and print it

You can replace yield 1 code with something that will process single lines.

 #!/usr/bin/env stack -- stack --resolver lts-8.4 --install-ghc runghc --package conduit-combinators import Conduit main :: IO () main = (runConduit $ stdinC .| decodeUtf8C .| peekForeverE (lineC (yield (1 :: Int))) .| sumC) >>= print

Read large lines in a huge file without buffering

More articles: