Java: a quick way to do random reads on huge files (disks)

Question

Java: a quick way to do random reads on huge files (disks)

I have a moderately large data set, about 800 MB or so, which is basically a small pre-computed table that I need to speed up some computations by several orders of magnitude (to create this file, it took several computers using an optimized and multi-threaded algorithm. .. I really need this file).

Now that it has been calculated once, only 800 MB of data is read only.

I can not store it in my memory.

This is currently one large, huge 800 megabyte file, but splitting it into smaller files is not a problem if that can help.

I need to read about 32 bits of data here and there in this file a lot of time. Before that, I do not know where I will need to read this data: the reading is evenly distributed.

What would be the fastest way in Java to execute my random reads in such a file or files? Ideally, I should do these reads from several unrelated threads (but, if necessary, I could queue reads in one thread).

Is Java NIO a way to transition?

I am not familiar with the “memory mapped file”: I think I do not want to display 800 MB in memory.

All I want is the fastest random reads I can get to access this 800 MB of disk data.

btw in case people are wondering that this is not at all the same as the question I asked recently:

Java: fast disk-based hash

+4

java nio

cocotwo Feb 27 '10 at 9:18

source share

4 answers

Stu thompson · Answer 1 · 2010-04-24T07:16:58+0000

800 MB is not much to load and store in memory. If you can afford multiple machines dumped in a dataset over several days, can you afford an extra GB or two of RAM, no?

So read java.nio.MappedByteBuffer in Java. From your comment you can see that I do not want to display 800 MB in memory, that the concept is not clear.

In the walnut shell, the displayed byte buffer allows you to programmatically access data, both in memory, although it can be on disk or in memory - this is for the OS to decide, since Java MBB is based on the OS virtual memory subsystem. It is also nice and fast. You can also safely access one MBB from multiple threads.

Here are the steps I recommend you take:

Create an instance of MappedByteBuffer that maps your data file to MBB. Creating is kind of expensive, so keep it around.
In your search method ...
- create an instance of the byte[4] array byte[4]
- call .get(byte[] dst, int offset, int length)
- an array of bytes will now have your data, which you can turn into a value

And presto! You have data!

I am a big fan of MBB and have successfully used them for such tasks in the past.

Konrad garus · Answer 2 · 2010-02-27T09:26:58+0000

RandomAccessFile (blocking) can help: http://java.sun.com/javase/6/docs/api/java/io/RandomAccessFile.html

You can also use FileChannel.map() to map a region of a file to memory, then read MappedByteBuffer .

See also: http://java.sun.com/docs/books/tutorial/essential/io/rafs.html

Peter Lawrey · Answer 3 · 2010-02-27T12:56:52+0000

Actually 800 MB is not very big. If you have 2 GB of memory or more, it may be in the disk cache, if not in the application itself.

Ross judson · Answer 4 · 2012-02-15T06:40:44+0000

For the record case, in Java 7, you should look at AsynchronousFileChannel.

When making random write-oriented records in large files (exceeding physical memory, so caching does not help everyone) in NTFS, I found that AsynchronousFileChannel performs twice as many single-threaded operations as compared to regular FileChannel (10 GB) file, 160-byte records, completely random records, some random content, several hundred iterations of the benchmarking cycle to achieve a steady state, approximately 5,300 records per second).

My best guess is that since asynchronous io comes down to overlapping IO in Windows 7, the NTFS file system driver can update its internal structures faster when it does not need to create a synchronization point after each call.

I micro-compared with RandomAccessFile to see how it would work (the results are very close to the FileChannel and half the AsynchronousFileChannel's performance.

Not sure what is happening with multithreaded recording. This is on Java 7, on SSD (SSD is an order of magnitude faster than magnetic, and another order of magnitude faster for smaller files that fit into memory).

It will be interesting to see if the same relationship persists on Linux.

Java: a quick way to do random reads on huge files (disks)

More articles: