An idiomatic way to parse a file in Scala

I am parsing a file in Scala, I have two types of files to read:

Set of train offers with this form:

String\tString\tInt
String\tString\tInt
// ...
String\tString\tInt

And a set of test sentences with this form:

String\tString\tString\tInt
String\tString\tString\tInt
// ...
String\tString\tString\tInt

So far I have used Eitherto distinguish between formats:

  def readDataSet(file: String): Option[Vector[LabeledSentence]] = {

  def getSentenceType(s: Array[String]) = s.length match {
    case 3 => Left((s(0), s(1), s(2).toInt))
    case 4 => Right((s(0), s(1), s(2), s(3).toInt))
    case _ => Right(("EOS", "EOS", "EOS", -1))
  }

    val filePath = getClass.getResource(file).getPath

    Manage(Source.fromFile(filePath)) { source =>

      val parsedTuples = source getLines() map (s => s.split("\t"))

      // ..........

      // Got throught each token in the file and construct a sentence
      for (s <- parsedTuples) {
        getSentenceType(s) match {
          // When reaching the end of the sentence, save it
          case Right(("EOS", "EOS", "EOS", -1)) =>
            sentences += new LabeledSentence(lex.result(), po.result(), dep.result())
            lex.clear()
            po.clear()
            dep.clear()
          //            if (isTrain) gold.clear()
          case Left(x) =>
            lex += x._1
            po += x._2
            dep += x._3
          case Right(x) =>
            lex += x._1
            po += x._2
            gold += x._3
            dep += x._4
        }
      }
      Some(sentences.result())
    }
  } 

Is there a better / idiomatic way to simplify this code?

I removed a piece of code that is not important for this purpose. If you want to see the full code check out my github page

UPDATE . Following the advice of Dima, I simplified my code with Monoid, here is the result:

val parsedTuples = source
    .getLines()
    .map(s => s.split("\t"))
    .map {
      case Array(a, b, c, d) => Tokens(a, b, c, d.toInt)
      case Array(a, b, d) => Tokens(a, b, "", d.toInt)
      case _ => Tokens() // Read end of sentence
    }.foldLeft((Tokens(), Vector.empty[LabeledSentence])) {
    // When reading an end of sentence, create a new Labeled sentence with tokens
    case ((z, l), t) if t.isEmpty => (Tokens(), l :+ LabeledSentence(z))
    // Accumulate tokens of the sentence
    case ((z, l), t) => (z append(z, t), l)
  }._2
+4
source share
1 answer

You do not need Either. Always use a 4-tuple:

  source
    .getLines
    .map(_.split("\\t"))
    .map {
      case Array(a, b, c, d) => Some(a, b, c, d.toInt)
      case Array(a, b, d) => Some(a, b, "", d.toInt)
      case _ => None
    }.foldLelft((List.empty[LabeledSentence], List[String].empty, List.empty[String], List.empty[String], List.empty[Int])) {
      case ((l, lex, po, gold, dep), None) =>
         (new LabeledSentence(lex.reverse, po.reverse, fold.reverse, dep.reverse)::l, List(), List(), List(), List())
      case ((l, lex, po, gold, dep), Some((a, b, c, d))) => 
         (l, a::lex, b::po, c::gold, d::dep)
   }._1._1.reverse

, lex, po, gold, dep ( case / LabeledSentence, ?).

, , , . java...

+4

Source: https://habr.com/ru/post/1657520/


All Articles