Using Scalaz Stream for a parsing task (replacing Scalaz Iteratees)

Introduction

I use Scalaz 7 iteration in a number of projects, mainly for processing large files. I would like to start switching to Scalaz streams that are meant to replace the iteratee package (which, frankly, is missing a lot of pieces and is a kind of pain to use).

Threads are based on machines (another variation on an iterative idea), which are also implemented in Haskell. I used the Haskell machine library a little, but the connection between the machines and the threads is not entirely obvious (at least for me), and the documentation for the thread library is a bit rare .

This question is about a simple parsing task that I would like to see implemented using threads instead of iterations. I myself will answer the question if no one will beat me, but Iโ€™m sure that Iโ€™m not the only one who makes (or at least considering) this transition, and since I still need to work in this exercise, I I could also do it publicly.

Task

Suppose I have a file containing sentences that have been marked and marked with parts of speech:

no UH , , it PRP was VBD n't RB monday NNP . . the DT equity NN market NN was VBD illiquid JJ . . 

There is one token per line, words and parts of speech are separated by one space, and empty lines represent the boundaries of sentences. I want to parse this file and return a list of sentences, which we could also represent as lists of tuples of strings:

 List((no,UH), (,,,), (it,PRP), (was,VBD), (n't,RB), (monday,NNP), (.,.)) List((the,DT), (equity,NN), (market,NN), (was,VBD), (illiquid,JJ), (.,.) 

As usual, we want to gracefully fail if we hit invalid exceptions for entering or reading files, we donโ€™t want to worry about manually closing resources, etc.

Iterative Solution

First for some common file reading files (which really should be part of the iteratee package, which currently does not provide anything remotely at this high level):

 import java.io.{ BufferedReader, File, FileReader } import scalaz._, Scalaz._, effect.IO import iteratee.{ Iteratee => I, _ } type ErrorOr[A] = EitherT[IO, Throwable, A] def tryIO[A, B](action: IO[B]) = I.iterateeT[A, ErrorOr, B]( EitherT(action.catchLeft).map(I.sdone(_, I.emptyInput)) ) def enumBuffered(r: => BufferedReader) = new EnumeratorT[String, ErrorOr] { lazy val reader = r def apply[A] = (s: StepT[String, ErrorOr, A]) => s.mapCont(k => tryIO(IO(Option(reader.readLine))).flatMap { case None => s.pointI case Some(line) => k(I.elInput(line)) >>== apply[A] } ) } def enumFile(f: File) = new EnumeratorT[String, ErrorOr] { def apply[A] = (s: StepT[String, ErrorOr, A]) => tryIO( IO(new BufferedReader(new FileReader(f))) ).flatMap(reader => I.iterateeT[String, ErrorOr, A]( EitherT( enumBuffered(reader).apply(s).value.run.ensuring(IO(reader.close())) ) )) } 

And then our suggestion reader:

 def sentence: IterateeT[String, ErrorOr, List[(String, String)]] = { import I._ def loop(acc: List[(String, String)])(s: Input[String]): IterateeT[String, ErrorOr, List[(String, String)]] = s( el = _.trim.split(" ") match { case Array(form, pos) => cont(loop(acc :+ (form, pos))) case Array("") => cont(done(acc, _)) case pieces => val throwable: Throwable = new Exception( "Invalid line: %s!".format(pieces.mkString(" ")) ) val error: ErrorOr[List[(String, String)]] = EitherT.left( throwable.point[IO] ) IterateeT.IterateeTMonadTrans[String].liftM(error) }, empty = cont(loop(acc)), eof = done(acc, eofInput) ) cont(loop(Nil)) } 

And finally, our syntax action:

 val action = I.consume[List[(String, String)], ErrorOr, List] %= sentence.sequenceI &= enumFile(new File("example.txt")) 

We can demonstrate that it works:

 scala> action.run.run.unsafePerformIO().foreach(_.foreach(println)) List((no,UH), (,,,), (it,PRP), (was,VBD), (n't,RB), (monday,NNP), (.,.)) List((the,DT), (equity,NN), (market,NN), (was,VBD), (illiquid,JJ), (.,.)) 

And you're done.

What I want

More or less the same program implemented using Scalaz streams rather than iteration.

+48
scala iteratee scalaz scalaz-stream transducer-machines
Aug 07 '13 at 19:33
source share
1 answer

Scalar Flow Solution:

 import scalaz.std.vector._ import scalaz.syntax.traverse._ import scalaz.std.string._ val action = linesR("example.txt").map(_.trim). splitOn("").flatMap(_.traverseU { s => s.split(" ") match { case Array(form, pos) => emit(form -> pos) case _ => fail(new Exception(s"Invalid input $s")) }}) 

We can demonstrate that it works:

 scala> action.collect.attempt.run.foreach(_.foreach(println)) Vector((no,UH), (,,,), (it,PRP), (was,VBD), (n't,RB), (monday,NNP), (.,.)) Vector((the,DT), (equity,NN), (market,NN), (was,VBD), (illiquid,JJ), (.,.)) 

And you're done.

The traverseU function is a common Scalaz combinator. In this case, it is used to move the Vector clause generated by splitOn in the Process monad. This is equivalent to map followed by sequence .

+49
Aug 07 '13 at 22:58
source share



All Articles