What are pipes / pipelines trying to solve

Question

What are pipes / pipelines trying to solve

I saw people recommending a channel / channel library for various tasks related to deferred I / O. What problem do these libraries solve exactly?

In addition, when I try to use some libraries related to hacker attacks, it is very likely that there are three different versions. Example:

It bothers me. For my parsing tasks should I use attoparsec or pipe-attoparsec / attoparsec -duit? What benefit does the pipe / channel version have compared to plain vanilla atoparsec?

+49

pipe haskell conduit haskell-pipes

Sibi Mar 30 '14 at 8:44

source share

3 answers

If you want to use attoparsec, use attoparsec

Should my parsing tasks use attoparsec or pipe-attoparsec / attoparsec-conduit?

Both pipes-attoparsec and attoparsec-conduit convert the given attoparsec Parser into a sink / conduit or pipe. Therefore, you should use attoparsec anyway.

What is the use of the pipe / conduit version compared to a simple vanilla attoparseka?

They work with pipes and conduit where vanilla will not (at least not out of the box).

If you are not using conduit or tubing and you are satisfied with the current performance of your lazy I / O, you do not need to change the current stream, especially if you are not writing a large application or processing large files, you can simply use attoparsec .

However, this assumes that you know the disadvantages of lazy I / O.

What about lazy IO? (Researching problems `withFile` )

Do not forget your first question:

What problem do these libraries solve?

They solve the problem of streaming data (see 1 and 3 ), which occurs in functional languages with lazy IO. Lazy IO sometimes gives you not what you want (see the example below), and sometimes it is difficult to determine the actual system resources needed for a particular lazy operation (this is reading / writing data in chunks / bytes / buffering / onclose / onopen ... )

Example for laziness

 import System.IO main = withFile "myfile" ReadMode hGetContents >>= return . (take 5) >>= putStrLn

It doesn’t print anything, as data evaluation takes place in putStrLn , but the handle is already closed at this point.

Fixing fire with poisonous acid

While the following snippet fixes this, it has another nasty function:

 main = withFile "myfile" ReadMode $ \handle -> hGetContents handle >>= return . (take 5) >>= putStrLn

In this case, hGetContents will read the entire file , which you did not expect at first. If you just want to check the magic bytes of a file that is several GB in size, this is not the way to go.

Use `withFile`

The solution obviously corresponds to take things in the context withFile :

 main = withFile "myfile" ReadMode $ \handle -> fmap (take 5) (hGetContents handle) >>= putStrLn

This, by the way, is also the solution mentioned by the pipe author :

This [..] answers the question that people sometimes ask me about pipes , which I will paraphase here:
If resource management is not the main focus of pipes , why should I use pipes instead of lazy I / O?
Many people who ask this question have opened streaming programming through Oleg, who addressed the lazy I / O problem in terms of resource management. However, I never found this argument convincing in isolation; you can solve most resource management problems by simply separating resource collection from lazy I / O, for example: [see last example above]

This brings us back to my previous statement:

You can just use attoparsec [...] [with lazy IO, assuming] that you know the disadvantages of lazy I / O.

References

Iteratee I / O , which better explains the example and provides a better overview.
Gabriel Gonzalez (pipe writer / author): The discussion of streaming programming
Michael Sneumann (channel author / author): Conduit versus Enumerator

+18

Zeta Mar 30 '14 at 9:22

source share

Here's a great podcast with the authors of both libraries:

http://www.haskellcast.com/episode/006-gabriel-gonzalez-and-michael-snoyman-on-pipes-and-conduit/

He will answer most of your questions.

In short, both of these libraries approach the problem of streaming, which is very important when working with IO. In fact, they control the transfer of data into pieces, which allows you, for example, transfer a 1 GB file using only 64 KB of RAM both on the server and on the client. Without streaming, you would have to allocate so much memory at both ends.

An older alternative to these libraries is lazy IO, but it is filled with problems and makes applications error-prone. These issues are discussed in the podcast.

Regarding which of these libraries to use, it is rather a matter of taste. I prefer pipes. Detailed differences are also discussed in the podcast.

+13

Nikita Volkov Mar 30 '14 at 9:22

source share

J. Abrahamson · Accepted Answer · 2014-03-30 16:35

Lazy I.O.

Lazy IO works like that

readFile :: FilePath -> IO ByteString

where a ByteString guaranteed to be read only by ByteString . For this, we could (almost) write

 -- given 'readChunk' which reads a chunk beginning at n readChunk :: FilePath -> Int -> IO (Int, ByteString) readFile fp = readChunks 0 where readChunks n = do (n', chunk) <- readChunk fp n chunks <- readChunks n' return (chunk <> chunks)

but here we note that the readChunks n' I / O readChunks n' is executed until even the partial result available as chunk is returned. This means that we are not at all lazy. To combat this, we use unsafeInterleaveIO

 readFile fp = readChunks 0 where readChunks n = do (n', chunk) <- readChunk fp n chunks <- unsafeInterleaveIO (readChunks n') return (chunk <> chunks)

which causes readChunks n' to return immediately, so the IO action will only be executed when this thunk is forced.

This is the dangerous part: with unsafeInterleaveIO we put off a bunch of IO operations at non-deterministic points in the future, which depend on how we consume our ByteString chunks.

Correction of a problem with coroutines

We would like to take the chunk processing step between the call to readChunk and recursion readChunks .

 readFileCo :: Monoid a => FilePath -> (ByteString -> IO a) -> IO a readFileCo fp action = readChunks 0 where readChunks n = do (n', chunk) <- readChunk fp n a <- action chunk as <- readChunks n' return (a <> as)

Now we have the ability to perform arbitrary IO actions after loading each small fragment. This allows us to do a lot more work gradually, without fully loading the ByteString into memory. Unfortunately, this is not a very complicated composition - we need to build our action consumption and pass it to our producer ByteString to launch it.

IO-based pipes

This is basically what pipes solves - it allows us to easily compose efficient coroutines. For example, we will now write the file reader as a Producer , which can be considered as “streaming” pieces of a file when its effect gets to run in the end.

 produceFile :: FilePath -> Producer ByteString IO () produceFile fp = produce 0 where produce n = do (n', chunk) <- liftIO (readChunk fp n) yield chunk produce n'

Note the similarities between this code and readFileCo above - we just replace the coroutine action call with the yield chunk we created so far. This call to yield a Producer type instead of the raw IO action that we can compose with other Pipe types to create a convenient consumption pipeline called Effect IO() .

All this channel construction is performed statically, without any IO actions. Here's how pipes allow you to write your coroutines more easily. All effects are triggered immediately when we call runEffect in our main IO action.

 runEffect :: Effect IO () -> IO ()

Attoparsec

So why do you want to connect attoparsec to pipes ? Well, attoparsec optimized for lazy parsing. If you produce pieces served to an attoparsec parser in an effectful manner, then you will be at a dead end. Could you

Use strict I / O and load the entire string into memory only for lazy use by your analyzer. It is simple, predictable, but inefficient.
Use lazy I / O and lose the ability to reason about when your I / O production effects will actually be triggered, causing possible resource leaks or closed-handle exceptions according to the consumption schedule of the items analyzed. This is more efficient than (1), but can easily become unpredictable; or,
Use pipes (or conduit ) to create a coroutine system that includes your lazy attoparsec analyzer allowing it to work with the minimum necessary input, while at the same time generating the analyzed values as lazily as possible throughout the stream.

What are pipes / pipelines trying to solve

Lazy I.O.

Correction of a problem with coroutines

IO-based pipes

Attoparsec

If you want to use attoparsec, use attoparsec

What about lazy IO? (Researching problems withFile )

Example for laziness

Fixing fire with poisonous acid

Use withFile

References

More articles:

What about lazy IO? (Researching problems `withFile` )

Use `withFile`