Scala, read the file, process the lines and write the output to a new file using the simultaneous (akka), asynchronous APIs (nio2)

1: I had a problem processing a large text file - 10Gigs +

The single thread solution is as follows:

val writer = new PrintWriter(new File(output.getOrElse("output.txt")));
for(line <- scala.io.Source.fromFile(file.getOrElse("data.txt")).getLines())
{
  writer.println(DigestUtils.HMAC_SHA_256(line))
}
writer.close()

2: I tried parallel processing using

val futures = scala.io.Source.fromFile(file.getOrElse("data.txt")).getLines
               .map{ s => Future{ DigestUtils.HMAC_SHA_256(s) } }.to
val results = futures.map{ Await.result(_, 10000 seconds) }

This causes the GC limit to be exceeded (see Appendix A for stacktrace)

3: I tried using Akka IO with AsynchronousFileChannel after https://github.com/drexin/akka-io-file I can read the file in byte blocks using FileSlurp, but could not find a solution to read the file line by line, which is requirement.

Any help would be greatly appreciated. Thank.

APPENDIX A

[error] (run-main) java.lang.OutOfMemoryError: GC overhead limit exceeded
java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.nio.CharBuffer.wrap(Unknown Source)
        at sun.nio.cs.StreamDecoder.implRead(Unknown Source)
        at sun.nio.cs.StreamDecoder.read(Unknown Source)
        at java.io.InputStreamReader.read(Unknown Source)
        at java.io.BufferedReader.fill(Unknown Source)
        at java.io.BufferedReader.readLine(Unknown Source)
        at java.io.BufferedReader.readLine(Unknown Source)
        at scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.s
cala:67)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
        at scala.collection.Iterator$class.foreach(Iterator.scala:727)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
        at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:
48)
        at scala.collection.immutable.VectorBuilder.$plus$plus$eq(Vector.scala:7
16)
        at scala.collection.immutable.VectorBuilder.$plus$plus$eq(Vector.scala:6
92)
        at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
        at scala.collection.AbstractIterator.to(Iterator.scala:1157)
        at com.test.Twitterhashconcurrentcli$.doConcurrent(Twitterhashconcu
rrentcli.scala:35)
        at com.test.Twitterhashconcurrentcli$delayedInit$body.apply(Twitter
hashconcurrentcli.scala:62)
        at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
        at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:
12)
        at scala.App$$anonfun$main$1.apply(App.scala:71)
        at scala.App$$anonfun$main$1.apply(App.scala:71)
        at scala.collection.immutable.List.foreach(List.scala:318)
        at scala.collection.generic.TraversableForwarder$class.foreach(Traversab
leForwarder.scala:32)
        at scala.App$class.main(App.scala:71)
+4
3

, . , , , , , , , OOM. - . , (, Iterator, scala.io.Source.fromX) , work-pulling . , , , (, , , , ).

. , :

import akka.actor._
import akka.routing.RoundRobinLike
import akka.routing.RoundRobinRouter
import scala.io.Source
import akka.routing.Broadcast

object FileReadMaster{
  case class ProcessFile(filePath:String)
  case class ProcessLines(lines:List[String], last:Boolean = false)
  case class LinesProcessed(lines:List[String], last:Boolean = false)

  case object WorkAvailable
  case object GimmeeWork
}

class FileReadMaster extends Actor{
  import FileReadMaster._

  val workChunkSize = 10
  val workersCount = 10

  def receive = waitingToProcess

  def waitingToProcess:Receive = {
    case ProcessFile(path) =>
      val workers = (for(i <- 1 to workersCount) yield context.actorOf(Props[FileReadWorker])).toList
      val workersPool = context.actorOf(Props.empty.withRouter(RoundRobinRouter(routees = workers)))
      val it = Source.fromFile(path).getLines
      workersPool ! Broadcast(WorkAvailable)
      context.become(processing(it, workersPool, workers.size))

      //Setup deathwatch on all
      workers foreach (context watch _)
  }

  def processing(it:Iterator[String], workers:ActorRef, workersRunning:Int):Receive = {
    case ProcessFile(path) => 
      sender ! Status.Failure(new Exception("already processing!!!"))


    case GimmeeWork if it.hasNext =>
      val lines = List.fill(workChunkSize){
        if (it.hasNext) Some(it.next)
        else None
      }.flatten

      sender ! ProcessLines(lines, it.hasNext)

      //If no more lines, broadcast poison pill
      if (!it.hasNext) workers ! Broadcast(PoisonPill)

    case GimmeeWork =>
      //get here if no more work left

    case LinesProcessed(lines, last) =>
      //Do something with the lines

    //Termination for last worker
    case Terminated(ref)  if workersRunning == 1 =>
      //Done with all work, do what you gotta do when done here

    //Terminared for non-last worker
    case Terminated(ref) =>
      context.become(processing(it, workers, workersRunning - 1))

  }
}

class FileReadWorker extends Actor{
  import FileReadMaster._

  def receive = {
    case ProcessLines(lines, last) => 
      sender ! LinesProcessed(lines.map(_.reverse), last)
      sender ! GimmeeWork

    case WorkAvailable =>
      sender ! GimmeeWork
  }
}

, . , . , . , , , , . , .

, , , , . , , .

+13

, , , ( List.to). , OOME.

, , . ( ): . , DigestUtils.HMAC_SHA_256(s) , . . , , - , : , . , (, 1000 ) ArrayBlockingQueue (, 1000). , , . take.

"output.txt", . , , .

+2

:)

- . , Akka, LineProcessor, :

val processor = system.actorOf(Props(new LineProcessor))

val src = scala.io.Source.fromFile(file.getOrElse("data.txt"))

src.getLines.foreach(line => processor ! line)  

LineProcessor :

class LineProcessor extends Actor {
  def receive {
    case line => // process the line
  }
}    

The trick is that with actors, you can scale horizontally quite easily. Just wrap the LineProcessor actor inside the router ...

// this will create 10 workers to process your lines simultaneously
val processor = system.actorOf(Props(new LineProcessor).withRouter(RoundRobinRouter(10))

One thing worth mentioning: if you need to write lines somewhere with a stored order, it gets a little more complicated. =) (when reading a line from a file, you also need to write its number, and when you write it, you need to coordinate the work of all employees)

+1
source

Source: https://habr.com/ru/post/1531607/


All Articles