Editing a large file in Scala

Question

Editing a large file in Scala

I am trying to modify a large PostScript file in Scala (some up to 1 GB in size). A file is a group of batches, each of which contains a code that represents a batch number, number of pages, etc.

I need:

Search for a file for batch codes (which always begin on the same line in the file)
Count the number of pages before the next batch code
Change the batch code to indicate how many pages in each batch.
Save the new file elsewhere.

My current solution uses two iterators ( iterA and iterB ) created from Source.fromFile("file.ps").getLines . The first iterator ( iterA ) runs in the while loop to the beginning of the batch code ( iterB.next also called every time). iterB then continues searching until the next command code (or end of file), counting the number of pages that it passes as it passes. Then it updates the batch code at iterA , the process repeats.

This seems very un-w61> -like, and I still have not developed a good way to save these changes in a new file.

What is a good approach to this problem? Should I completely disable iterators? I would prefer to do this without immediately having all the input or output to memory.

Thanks!

+6

iterator scala file-io scala-2.9 postscript

Andrew Conner Feb 16 '12 at 16:19

source share

3 answers

If you did not achieve functional scala enlightenment, I would recommend a more imperative style using java.util.Scanner # findWithinHorizon . My example is pretty naive, iterating through the input twice.

 val scanner = new Scanner(inFile) val writer = new BufferedWriter(...) def loop() = { // you might want to limit the horizon to prevent OutOfMemoryError Option(scanner.findWithinHorizon(".*YOUR-BATCH-MARKER", 0)) match { case Some(batch) => val pageCount = countPages(batch) writePageCount(writer, pageCount) writer.write(batch) loop() case None => } } loop() scanner.close() writer.close()

+1

Mxfr Feb 17 '12 at 11:15

source share

Maybe you can use span and duplicate efficiently. Assuming the iterator is positioned at the beginning of the batch, you take a span before the next batch, duplicate it so that you can count the pages, write the modified row of the batch, and then write the pages using the duplicated iterator. Then we process the next batch recursively ...

 def batch(i: Iterator[String]) { if (i.hasNext) { assert(i.next() == "batch") val (current, next) = i.span(_ != "batch") val (forCounting, forWriting) = current.duplicate val count = forCounting.filter(_ == "p").size println("batch " + count) forWriting.foreach(println) batch(next) } }

Assuming the following input:

 val src = Source.fromString("head\nbatch\np\np\nbatch\np\nbatch\np\np\np\n")

You position the iterator at the beginning of the batch, and then process the batch:

 val (head, next) = src.getLines.span(_ != "batch") head.foreach(println) batch(next)

Fingerprints:

 head batch 2 p p batch 1 p batch 3 p p p

0

huynhjl Feb 17 '12 at 7:47

source share

stephenjudkins · Accepted Answer · 2012-02-17T01:10:55+0000

You could implement this using the Scala Stream class. I assume that you do not mind holding one “batch” in memory at a time.

 import scala.annotation.tailrec import scala.io._ def isBatchLine(line:String):Boolean = ... def batchLine(size: Int):String = ... val it = Source.fromFile("in.ps").getLines // cannot use it.toStream here because of SI-4835 def inLines = Stream.continually(i).takeWhile(_.hasNext).map(_.next) // Note: using `def` instead of `val` here means we don't hold // the entire stream in memory def batchedLinesFrom(stream: Stream[String]):Stream[String] = { val (batch, remainder) = stream span { !isBatchLine(_) } if (batch.isEmpty && remainder.isEmpty) { Stream.empty } else { batchLine(batch.size) #:: batch #::: batchedLinesFrom(remainder.drop(1)) } } def newLines = batchedLinesFrom(inLines dropWhile isBatchLine) val ps = new java.io.PrintStream(new java.io.File("out.ps")) newLines foreach ps.println ps.close()

Editing a large file in Scala

More articles: