Parallel Reading Using Java MappedByteBuffer

Question

Parallel Reading Using Java MappedByteBuffer

I am trying to use MappedByteBuffer to allow concurrent reads in a file by multiple threads with the following restrictions:

File too large to load into memory
Themes should be able to read asynchronously (this is a web application)
A file is never written by any stream.
Each thread will always know the exact offset and length of the bytes that it needs to read (i.e., "no" search "by the application itself).

According to the docs ( https://docs.oracle.com/javase/8/docs/api/java/nio/Buffer.html ) Buffers are not thread safe because they retain their internal state (position, etc.). Is there a way to have simultaneous random access to a file without loading it all into memory?

Although FileChannel technically thread safe, from the docs:

If the channel of the channel is obtained from an existing stream or random access file, then the state of the channel of the file is closely related to the state of the object whose getChannel method returns the channel. Changing the position of the channel, either explicitly or by reading or writing bytes, will change the file position of the source object and vice versa

Thus, it would seem that it is simply synchronized. If I were in new RandomAccessFile().getChannel().map() in each stream [edit: each read], this does not affect the I / O overhead, and each of them should read what MappedByteBuffers should avoid ?

+5

java multithreading concurrency

OhHeyItsE May 23 '17 at 11:00

source share

2 answers

beatngu13 · Answer 1 · 2017-05-24T22:43:19+0000

Instead of using multiple threads for parallel readings, I would go with this approach (based on an example with a huge CSV file, the lines of which should be sent simultaneously via HTTP):

Reading one file at several positions at the same time will not allow you to go faster (but this can significantly slow you down).
Instead of reading a file from multiple threads, read the file from one thread and parallelize the processing of these lines. One thread should read your CSV lines and put each line in a queue. Then, several workflows should output the next line from the queue, analyze it, convert it into a request and process the request at the same time if necessary. Then the splitting of the work will be performed in a single thread, guaranteeing the absence of missing lines or overlaps.

If you can read the file line by line, the LineIterator from Commons IO is a memory efficient feature. If you need to work with pieces, your MappedByteBuffer seems like a reasonable approach. For the queue, I would use a blocking queue with a fixed capacity - for example, ArrayBlockingQueue - to better control memory usage (lines / chunks in the queue + lines / pieces among workers = lines / pieces in memory).

volatilevar · Answer 2 · 2017-05-25T01:13:42+0000

FileChannel supports read operation without synchronization. It originally uses pread for Linux:

 public abstract int read(ByteBuffer dst, long position) throws IOException

The following is the documentation of FileChannel :

... Other operations, in particular those that occupy an explicit position, may continue simultaneously; regardless of whether they really do this, depends on the underlying implementation and is therefore not specified.

This is pretty primitive, returning the number of bytes read (see details here ). But I think you can still use it with the assumption that "every thread will always know the exact offset and length of the bytes that it needs to read"

Parallel Reading Using Java MappedByteBuffer

More articles: