Why is ReversedLinesFileReader so slow?

I have a 21.6 GB file, and I want to read it from end to beginning, and not from beginning to end, as usual.

If I read each line of the file from beginning to end using the following code, it will take 1 minute, 12 seconds.

val startTime = System.currentTimeMillis()
File("very-large-file.xml").forEachLine {
    val i = 0
}
val diff = System.currentTimeMillis() - startTime
println(diff.timeFormat())

Now I read that for reading in a file in the reverse order I have to use ReversedLinesFileReaderfrom Apache Commons. To do this, I created the following extension function:

fun File.forEachLineFromTheEndOfFile(action: (line: String) -> Unit) {
    val reader = ReversedLinesFileReader(this, Charset.defaultCharset())
    var line = reader.readLine()
    while (line != null) {
        action.invoke(line)
        line = reader.readLine()
    }

    reader.close()
}

and then call it as follows, which is similar to the previous method only when calling the function forEachLineFromTheEndOfFile:

val startTime = System.currentTimeMillis()
File("very-large-file.xml").forEachLineFromTheEndOfFile {
    val i = 0
}
val diff = System.currentTimeMillis() - startTime
println(diff.timeFormat())

It took 17 minutes and 50 seconds !

  • Am I using it correctly ReversedLinesFileReader?
  • I am running Linux Mint with Ext4 file system on SSD. Could this have anything to do with this?
  • , ?
+4
2

. , ( , ), XML , UTF-8, , .

, , . ( : ?) (?) , ( ?). , , , , (XML-, , ByteArrays ). , , .

, , , - , . , , , .

, , , 1 . , , , ( ), "", .

, , . , .

UPDATE:

, , 27G , 10 JSON wikidata .

MacBook Pro 2015 ( dev , 5 , , , ):

reading in reverse order: 244,648 ms = 244 secs = 4 min 4 secs
reading in forward order:  77,564 ms =  77 secs = 1 min 17 secs

temp file count:   201
approx char count: 29,483,478,770 (line content not including line endings)
total line count:  10,050,000

, 50000 , . , , . . , . , , .

, , , , . CPU - gzipped , , gzipping . , , , , , ..

, :

package com.stackoverflow.reversefile

import java.io.File
import java.util.*

fun main(args: Array<String>) {
    val maxBufferSize = 50000
    val lineBuffer = ArrayList<String>(maxBufferSize)
    val tempFiles = ArrayList<File>()
    val originalFile = File("/data/wikidata/20150629.json")
    val tempFilePrefix = "/data/wikidata/temp/temp"
    val maxLines = 10000000

    var approxCharCount: Long = 0
    var tempFileCount = 0
    var lineCount = 0

    val startTime = System.currentTimeMillis()

    println("Writing reversed partial files...")

    try {
        fun flush() {
            val bufferSize = lineBuffer.size
            if (bufferSize > 0) {
                lineCount += bufferSize
                tempFileCount++
                File("$tempFilePrefix-$tempFileCount").apply {
                    bufferedWriter().use { writer ->
                        ((bufferSize - 1) downTo 0).forEach { idx ->
                            writer.write(lineBuffer[idx])
                            writer.newLine()
                        }
                    }
                    tempFiles.add(this)
                }
                lineBuffer.clear()
            }

            println("  flushed at $lineCount lines")
        }

        // read and break into backword sorted chunks
        originalFile.bufferedReader(bufferSize = 4096 * 32)
                .lineSequence()
                .takeWhile { lineCount <= maxLines }.forEach { line ->
                    lineBuffer.add(line)
                    if (lineBuffer.size >= maxBufferSize) flush()
                }
        flush()

        // read backword sorted chunks backwards
        println("Reading reversed lines ...")
        tempFiles.reversed().forEach { tempFile ->
            tempFile.bufferedReader(bufferSize = 4096 * 32).lineSequence()
                .forEach { line ->
                    approxCharCount += line.length
                    // a line has been read here
                }
            println("  file $tempFile current char total $approxCharCount")
        }
    } finally {
        tempFiles.forEach { it.delete() }
    }

    val elapsed =  System.currentTimeMillis() - startTime

    println("temp file count:   $tempFileCount")
    println("approx char count: $approxCharCount")
    println("total line count:  $lineCount")
    println()
    println("Elapsed:  ${elapsed}ms  ${elapsed / 1000}secs  ${elapsed / 1000 / 60}min  ")

    println("reading original file again:")
    val againStartTime = System.currentTimeMillis()
    var againLineCount = 0
    originalFile.bufferedReader(bufferSize = 4096 * 32)
            .lineSequence()
            .takeWhile { againLineCount <= maxLines }
            .forEach { againLineCount++ }
    val againElapsed =  System.currentTimeMillis() - againStartTime
    println("Elapsed:  ${againElapsed}ms  ${againElapsed / 1000}secs  ${againElapsed / 1000 / 60}min  ")
}
+1

:

  • Java.
  • , , .
  • , , .

Q: ReversedLinesFileReader ?

. (, , . , . , , 1 .)

: Linux Mint Ext4 SSD. - ?

. , , -, . SSD.

Q: , ?

. . .


, , , . () .

, , O(N^2), . ( ) , "" FilePart. , "" .

+2

Source: https://habr.com/ru/post/1655124/


All Articles