I am parsing a file in Scala, I have two types of files to read:
Set of train offers with this form:
String\tString\tInt
String\tString\tInt
String\tString\tInt
And a set of test sentences with this form:
String\tString\tString\tInt
String\tString\tString\tInt
String\tString\tString\tInt
So far I have used Eitherto distinguish between formats:
def readDataSet(file: String): Option[Vector[LabeledSentence]] = {
def getSentenceType(s: Array[String]) = s.length match {
case 3 => Left((s(0), s(1), s(2).toInt))
case 4 => Right((s(0), s(1), s(2), s(3).toInt))
case _ => Right(("EOS", "EOS", "EOS", -1))
}
val filePath = getClass.getResource(file).getPath
Manage(Source.fromFile(filePath)) { source =>
val parsedTuples = source getLines() map (s => s.split("\t"))
for (s <- parsedTuples) {
getSentenceType(s) match {
case Right(("EOS", "EOS", "EOS", -1)) =>
sentences += new LabeledSentence(lex.result(), po.result(), dep.result())
lex.clear()
po.clear()
dep.clear()
case Left(x) =>
lex += x._1
po += x._2
dep += x._3
case Right(x) =>
lex += x._1
po += x._2
gold += x._3
dep += x._4
}
}
Some(sentences.result())
}
}
Is there a better / idiomatic way to simplify this code?
I removed a piece of code that is not important for this purpose. If you want to see the full code check out my github page
UPDATE . Following the advice of Dima, I simplified my code with Monoid, here is the result:
val parsedTuples = source
.getLines()
.map(s => s.split("\t"))
.map {
case Array(a, b, c, d) => Tokens(a, b, c, d.toInt)
case Array(a, b, d) => Tokens(a, b, "", d.toInt)
case _ => Tokens()
}.foldLeft((Tokens(), Vector.empty[LabeledSentence])) {
case ((z, l), t) if t.isEmpty => (Tokens(), l :+ LabeledSentence(z))
case ((z, l), t) => (z append(z, t), l)
}._2