Slow performance when using BufferedReader

Question

Slow performance when using BufferedReader

I process several text files line by line using BufferReader.readlLine() .

Two files with a size of 130 MB, but one takes 40 seconds to process, and the other takes 75 seconds.

I noticed that one file has 1.8 million lines and the other 2.1 million lines. But when I tried to process the file with 3.0 million lines of the same size, it took 30 minutes to process.

So my question is:

This behavior is due to buffer search time (do I want to know how BufferedReader works or parses a file line by line?)
Is there any way to read a file line by line faster?

Friends, I provide more detailed information.

I split the string into three parts using a regular expression, and then using SimpleUnsortedWriter (provided by Cassandra), I write it to some file as a key, column and value. After processing 16 MB of data, it is flushed to disk.

But the processing logic is the same for all files, even one file of 330 MB in size, but less than about 1 million lines is processed in 30 seconds. What could be the reason?

 deviceWriter = new SSTableSimpleUnsortedWriter( directory, keyspace, "Devices", UTF8Type.instance, null, 16); Pattern pattern = Pattern.compile("[\\[,\\]]"); while ((line = br.readLine()) != null) { //split the line in row column and value long timestamp = System.currentTimeMillis() * 1000; deviceWriter .newRow(bytes(rowKey)); deviceWriter .addColumn(bytes(colmName), bytes(value), timestamp); }

Changed -Xmx256M to -Xmx 1024M , but this does not help.

Update: According to my observations, since I write to the buffer (in physical memory), as not. buffer entries increase; new entries take time. (This is my guess)

Answer, please.

+6

java readline text-processing bufferedreader seek

samarth Aug 24 '11 at 16:59

source share

4 answers

BufferedReader will not search, it simply caches characters until it finds a new line and returns a line as a line, discarding (reusing) the buffer after each line. This is why you can use it with any thread or other reader, even those that do not support search.

Thus, the number of lines should not make such a big difference at the reader level. However, a very long line can create a very large line and allocate a lot of RAM, but that doesn’t seem to be your case (in this case, it is likely to throw an OutOfMemory exception if the GC time is exceeded or similar).

What I see in your code, you are not doing anything wrong. I suppose you press some kind of limit, since it does not look like RAM, maybe it is due to some kind of strict restriction on the side of Cassandra? Have you tried to comment on the part that writes in Kassandra? just to find out if it is your side or that of Cassandra.

+1

Simone gianni Aug 24 '11 at 17:03

source share

Take a look at NIO Buffered as they are more optimized than BufferReader.

Some snippets of code from another forum. http://www.velocityreviews.com/forums/t719006-bufferedreader-vs-nio-buffer.html

 FileChannel fc = new FileInputStream("File.txt").getChannel(); ByteBuffer buffer = ByteBuffer.allocate(1024); fc.read(buffer);

Edit: Also see this topic Read large files in Java

+1

Farmor Aug 24 '11 at 17:05

source share

BufferedReader is probably not the root of your performance problem.

Based on the numbers you quote, it looks like you have quadratic complexity in your code. For example, for each line you read, you revise each line you read earlier. I'm just thinking here, but a typical example of a problem would be to use the list data structure and check if the new line matches the previous lines.

+1

erickson Aug 24 '11 at 17:29

source share

Michael borgwardt · Accepted Answer · 2011-08-24T17:11:56+0000

The only thing that BufferedReader does is read from the base Reader into an internal char[] buffer with a default size of 8K, and all methods work on this buffer until it is exhausted, and at that moment another 8K (or independently) is read from the base Reader . readLine() is the type of binding.

Proper use of BufferedReader does not have to lead to an increase in runtime from 40 seconds to 1.8 m lines to 30 minutes on 3 m lines. There should be something wrong with your code. Show it to us.

Another possibility is that your JVM does not have enough heap memory and spends most of 30 minutes collecting garbage because its heap is 99% full and you end OutOfMemoryError with an OutOfMemoryError with a large input. What do you do with the lines you processed? Are they stored in memory? Does the program start using the -Xmx 1024M command-line -Xmx 1024M ?

Slow performance when using BufferedReader

More articles: