Line separation using Conduit on Windows

I'm having problems using the conduit lib channel to split text by line.

The raw data that I work with, unfortunately, it is not consistent with the end of the line containing the sequence \r\nand \nin the same file.

I found the function linesin Data.Conduit.Binary, but it "breaks" into one byte ( \n, reasonably reasonably), which in some cases leaves me with a tail \r.

I understand why the current implementation works the way it is, and I'm basically sure I can hack some kind of solution together, but the only way I could do it is something like:

lines' = do
   loop $ T.pack ""
   where loop acc = do
         char <- await
         case char of
            Nothing -> return ()
            Just x -> do
            case (isOver $ acc `T.append` x) of
                    (True,y) -> yield y
                    (False,y) -> loop y
                    where isOver n
                           |  (T.takeEnd 2 n == _rLn)  = (True, T.dropEnd 2 n)
                           |  (T.takeEnd 1 n == _Ln)   = (True, T.dropEnd 1 n)
                           |  otherwise                =  (False,n)
                           where _rLn = T.pack $! "\r\n"
                                 _Ln = T.pack $! "\n"

... which seems inelegant, kludgy and terribly slow.

, , , "", , , \r, .

- ?

+4
2

! - , conduit-combinators. , \r , , . , , - \r, \n.

, slidingWindowC, , - "\r\n", . , \r, , linesUnboundedC.

{-# LANGUAGE TypeFamilies, FlexibleContexts #-}

import Data.Text (Text, singleton, empty)
import Data.MonoTraversable (Element, MonoFoldable)
import Conduit

main = runConduitRes $ (sourceFile "file.txt" :: Producer (ResourceT IO) Text)
                    .| linesUnboundedC'
                    .| printC

-- | Converted a chunked input of characters into lines delimited by \n or \r\n
linesUnboundedC'
  :: (Element a ~ Char, MonoFoldable a, Monad m) => ConduitM a Text m ()
linesUnboundedC' = concatMapC id
                .| slidingWindowC 2
                .| mapC (\cs@[c,_] -> if cs == "\r\n" then empty else singleton c)
                .| linesUnboundedC
+4

- Data.Conduit.Text foldLines, , .

+4

Source: https://habr.com/ru/post/1667028/


All Articles