Processing a large (GB) file quickly and repeatedly (Java)

What options exist for processing large files quickly, several times?

I have one file (min. 1.5 GB, but may be higher than 10-15 GB) that needs to be read several times - in order from hundreds to thousands of times. The server has a large amount of RAM (64+ GB) and many processors (24 +).

The file will be sequential, read-only. Files are encrypted (sensitive data) on disk. I also use MessagePack to deserialize them into objects during the reading process.

I can’t save the objects created from the file into memory - the extension is too large (a 1.5 GB file is converted into an array of objects in 35 GB memory). The file cannot be saved as an array of bytes (limited by the length of the Java array 2 ^ 32-1).

My initial thought was to use a memory mapped file, but this one has its own set of limitations.

The idea is to load a file from disk and into memory for processing.

A large amount of data for a machine learning algorithm that requires several readings. During the calculation of each pass of the file, there is a significant amount of heap use by the algorithm itself, which is inevitable, therefore, the requirement to read it several times.

+4
source share
3 answers

, , , mmap() ; syscall 2 ^ 64, FileChannel#map() 2 ^ 30.

, , FileChannel " ", .

"" ​​, : largetext. , , , , . , JDK , .

, Guava RangeMap<Long, MappedByteBuffer>.

CharSequence ; LargeByteMapping, , ; , , . , . , CharSequence , .

, , , -, largetext , ; , !

LargeByteMapping, factory , ; : .

. .


, ... A MappedByteBuffer HEAP!!

; ByteBuffer.allocateDirect(), , .

; , , !

+4

, . NoSQL (Wide-Column, Graph ..) . , . , , , . , , ( bigdata)

0

How about creating a “dictionary” as a bridge between your program and the target file? Your program will call the dictionary, then the dictionary will refer to a large file.

0
source

Source: https://habr.com/ru/post/1539074/


All Articles