Introduction
I use Scalaz 7 iteration in a number of projects, mainly for processing large files. I would like to start switching to Scalaz streams that are meant to replace the iteratee package (which, frankly, is missing a lot of pieces and is a kind of pain to use).
Threads are based on machines (another variation on an iterative idea), which are also implemented in Haskell. I used the Haskell machine library a little, but the connection between the machines and the threads is not entirely obvious (at least for me), and the documentation for the thread library is a bit rare .
This question is about a simple parsing task that I would like to see implemented using threads instead of iterations. I myself will answer the question if no one will beat me, but Iโm sure that Iโm not the only one who makes (or at least considering) this transition, and since I still need to work in this exercise, I I could also do it publicly.
Task
Suppose I have a file containing sentences that have been marked and marked with parts of speech:
no UH , , it PRP was VBD n't RB monday NNP . . the DT equity NN market NN was VBD illiquid JJ . .
There is one token per line, words and parts of speech are separated by one space, and empty lines represent the boundaries of sentences. I want to parse this file and return a list of sentences, which we could also represent as lists of tuples of strings:
List((no,UH), (,,,), (it,PRP), (was,VBD), (n't,RB), (monday,NNP), (.,.)) List((the,DT), (equity,NN), (market,NN), (was,VBD), (illiquid,JJ), (.,.)
As usual, we want to gracefully fail if we hit invalid exceptions for entering or reading files, we donโt want to worry about manually closing resources, etc.
Iterative Solution
First for some common file reading files (which really should be part of the iteratee package, which currently does not provide anything remotely at this high level):
import java.io.{ BufferedReader, File, FileReader } import scalaz._, Scalaz._, effect.IO import iteratee.{ Iteratee => I, _ } type ErrorOr[A] = EitherT[IO, Throwable, A] def tryIO[A, B](action: IO[B]) = I.iterateeT[A, ErrorOr, B]( EitherT(action.catchLeft).map(I.sdone(_, I.emptyInput)) ) def enumBuffered(r: => BufferedReader) = new EnumeratorT[String, ErrorOr] { lazy val reader = r def apply[A] = (s: StepT[String, ErrorOr, A]) => s.mapCont(k => tryIO(IO(Option(reader.readLine))).flatMap { case None => s.pointI case Some(line) => k(I.elInput(line)) >>== apply[A] } ) } def enumFile(f: File) = new EnumeratorT[String, ErrorOr] { def apply[A] = (s: StepT[String, ErrorOr, A]) => tryIO( IO(new BufferedReader(new FileReader(f))) ).flatMap(reader => I.iterateeT[String, ErrorOr, A]( EitherT( enumBuffered(reader).apply(s).value.run.ensuring(IO(reader.close())) ) )) }
And then our suggestion reader:
def sentence: IterateeT[String, ErrorOr, List[(String, String)]] = { import I._ def loop(acc: List[(String, String)])(s: Input[String]): IterateeT[String, ErrorOr, List[(String, String)]] = s( el = _.trim.split(" ") match { case Array(form, pos) => cont(loop(acc :+ (form, pos))) case Array("") => cont(done(acc, _)) case pieces => val throwable: Throwable = new Exception( "Invalid line: %s!".format(pieces.mkString(" ")) ) val error: ErrorOr[List[(String, String)]] = EitherT.left( throwable.point[IO] ) IterateeT.IterateeTMonadTrans[String].liftM(error) }, empty = cont(loop(acc)), eof = done(acc, eofInput) ) cont(loop(Nil)) }
And finally, our syntax action:
val action = I.consume[List[(String, String)], ErrorOr, List] %= sentence.sequenceI &= enumFile(new File("example.txt"))
We can demonstrate that it works:
scala> action.run.run.unsafePerformIO().foreach(_.foreach(println)) List((no,UH), (,,,), (it,PRP), (was,VBD), (n't,RB), (monday,NNP), (.,.)) List((the,DT), (equity,NN), (market,NN), (was,VBD), (illiquid,JJ), (.,.))
And you're done.
What I want
More or less the same program implemented using Scalaz streams rather than iteration.