Character decoding in Java: why is this faster with the reader than with buffers?

I am trying several ways to decode file bytes into characters.

Using java.io.Reader and Channels.newReader (...)

public static void decodeWithReader() throws Exception { FileInputStream fis = new FileInputStream(FILE); FileChannel channel = fis.getChannel(); CharsetDecoder decoder = Charset.defaultCharset().newDecoder(); Reader reader = Channels.newReader(channel, decoder, -1); final char[] buffer = new char[4096]; for(;;) { if(-1 == reader.read(buffer)) { break; } } fis.close(); } 

Using buffers and decoder manually:

 public static void readWithBuffers() throws Exception { FileInputStream fis = new FileInputStream(FILE); FileChannel channel = fis.getChannel(); CharsetDecoder decoder = Charset.defaultCharset().newDecoder(); final long fileLength = channel.size(); long position = 0; final int bufferSize = 1024 * 1024; // 1MB CharBuffer cbuf = CharBuffer.allocate(4096); while(position < fileLength) { MappedByteBuffer bbuf = channel.map(MapMode.READ_ONLY, position, Math.min(bufferSize, fileLength - position)); for(;;) { CoderResult res = decoder.decode(bbuf, cbuf, false); if(CoderResult.OVERFLOW == res) { cbuf.clear(); } else if (CoderResult.UNDERFLOW == res) { break; } } position += bbuf.position(); } fis.close(); } 

For a text file of 200 MB, the first approach sequentially takes 300 ms. The second approach sequentially takes 700 ms. Do you have an idea why the reader approach is much faster?

Can it work even faster with another implementation?

The test runs on Windows 7 and JDK7_07.

+4
source share
2 answers

Here is a third implementation that does not use mapped buffers. Under the same conditions as before, it works sequentially at 220 ms. The default encoding on my machine is "windows-1252", if I use the simpler encoding "ISO-8859-1", decoding is even faster (about 150 ms).

It seems that using built-in functions such as mapped buffers actually hurt performance (for this very practical case). It is also interesting if I allocate direct buffers instead of heap buffers (look at the commented lines), then performance decreases (mileage then takes about 400 ms).

Until now, it seems that you need to: decode characters as fast as possible in Java (provided that you cannot use one encoding), use a decoder manually, write a decoding cycle using heap buffers, do not use mapped buffers or even your own. I must admit that I really do not know why this is so.

 public static void readWithBuffers() throws Exception { FileInputStream fis = new FileInputStream(FILE); FileChannel channel = fis.getChannel(); CharsetDecoder decoder = Charset.defaultCharset().newDecoder(); // CharsetDecoder decoder = Charset.forName("ISO-8859-1").newDecoder(); ByteBuffer bbuf = ByteBuffer.allocate(4096); // ByteBuffer bbuf = ByteBuffer.allocateDirect(4096); CharBuffer cbuf = CharBuffer.allocate(4096); // CharBuffer cbuf = ByteBuffer.allocateDirect(2 * 4096).asCharBuffer(); for(;;) { if(-1 == channel.read(bbuf)) { decoder.decode(bbuf, cbuf, true); decoder.flush(cbuf); break; } bbuf.flip(); CoderResult res = decoder.decode(bbuf, cbuf, false); if(CoderResult.OVERFLOW == res) { cbuf.clear(); } else if (CoderResult.UNDERFLOW == res) { bbuf.compact(); } } fis.close(); } 
+2
source

For comparison, you can try.

 public static void readWithBuffersISO_8859_1() throws Exception { FileInputStream fis = new FileInputStream(FILE); FileChannel channel = fis.getChannel(); MappedByteBuffer bbuf = channel.map(FileChannel.MapMode.READ_ONLY, 0, channel.size()); while(bbuf.remaining()>0) { char ch = (char)(bbuf.get() & 0xFF); } fis.close(); } 

This assumes ISO-8859-1. If you need maximum speed, processing the text as a binary format can help if its an option.

As @EJP points out, you are changing several things at once, and you need to start with the simplest comparable example and see how many differences each element is added.

+2
source

Source: https://habr.com/ru/post/1442722/


All Articles