How to read a file using multiple streams in Java when a high bandwidth file system (3 GB / s) is available

I understand that reading a file using multiple streams is inefficient for a normal Spindle Drive system.

This is another case, I have high-performance file systems available to me, which provides read speeds of up to 3 GB / s, with 196 processor cores and 2 TB of RAM

A single-threaded Java program reads a file with a maximum of 85-100 MB / s, so I have the potential for improvement than a separate stream. I have to read files up to 1 TB in size and I have enough memory to download it.

I am currently using the following or something similar, but you need to write something with multi-threading to get the best throughput:

Java 7 files: 50 MB / s

List<String> lines = Files.readAllLines(Paths.get(path), encoding); 

Java commons-io: 48 MB / s

 List<String> lines = FileUtils.readLines(new File("/path/to/file.txt"), "utf-8"); 

The same with guava: 45 MB / s

 List<String> lines = Files.readLines(new File("/path/to/file.txt"), Charset.forName("utf-8")); 

Java scanner class: very slow

 Scanner s = new Scanner(new File("filepath")); ArrayList<String> list = new ArrayList<String>(); while (s.hasNext()){ list.add(s.next()); } s.close(); 

I want to be able to upload a file and build the same ArrayList, in the correct sorted sequence, as quickly as possible.

There is another question that reads similarly, but actually differs from: The question is to discuss systems in which multi-threaded file I / O is physically impossible to be effective, but due to technical advances, we now have systems that are designed to support high-performance I / O, and therefore the CPU / SW is a limiting factor, which can be overcome by multi-threaded I / O.

Another question does not answer how to write code for multi-threaded input-output.

+1
source share
2 answers

Here is a solution for reading a single file with multiple threads.

Divide the file into N pieces, read each piece in the stream, and then combine them in order. Beware of lines crossing block boundaries. This is a basic idea suggested by the user. slaks

Bench marking below the implementation of multiple threads for a single file of 20 GB:

1 Topic: 50 seconds: 400 MB / s

2 Topics: 30 seconds: 666 MB / s

4 Themes: 20 seconds: 1 GB / s

8 Threads: 60 seconds: 333 MB / s

Equivalent Java7 readAllLines (): 400 seconds: 50 MB / s

Note. This can only work on systems designed to support high-performance I / O, and not on ordinary personal computers.

 package filereadtests; import java.io.*; import static java.lang.Math.toIntExact; import java.nio.*; import java.nio.channels.*; import java.nio.charset.Charset; import java.util.concurrent.ExecutorService; import java.util.concurrent.Executors; public class FileRead implements Runnable { private FileChannel _channel; private long _startLocation; private int _size; int _sequence_number; public FileRead(long loc, int size, FileChannel chnl, int sequence) { _startLocation = loc; _size = size; _channel = chnl; _sequence_number = sequence; } @Override public void run() { try { System.out.println("Reading the channel: " + _startLocation + ":" + _size); //allocate memory ByteBuffer buff = ByteBuffer.allocate(_size); //Read file chunk to RAM _channel.read(buff, _startLocation); //chunk to String String string_chunk = new String(buff.array(), Charset.forName("UTF-8")); System.out.println("Done Reading the channel: " + _startLocation + ":" + _size); } catch (Exception e) { e.printStackTrace(); } } //args[0] is path to read file //args[1] is the size of thread pool; Need to try different values to fing sweet spot public static void main(String[] args) throws Exception { FileInputStream fileInputStream = new FileInputStream(args[0]); FileChannel channel = fileInputStream.getChannel(); long remaining_size = channel.size(); //get the total number of bytes in the file long chunk_size = remaining_size / Integer.parseInt(args[1]); //file_size/threads //Max allocation size allowed is ~2GB if (chunk_size > (Integer.MAX_VALUE - 5)) { chunk_size = (Integer.MAX_VALUE - 5); } //thread pool ExecutorService executor = Executors.newFixedThreadPool(Integer.parseInt(args[1])); long start_loc = 0;//file pointer int i = 0; //loop counter while (remaining_size >= chunk_size) { //launches a new thread executor.execute(new FileRead(start_loc, toIntExact(chunk_size), channel, i)); remaining_size = remaining_size - chunk_size; start_loc = start_loc + chunk_size; i++; } //load the last remaining piece executor.execute(new FileRead(start_loc, toIntExact(remaining_size), channel, i)); //Tear Down executor.shutdown(); //Wait for all threads to finish while (!executor.isTerminated()) { //wait for infinity time } System.out.println("Finished all threads"); fileInputStream.close(); } } 
+1
source

First you should try java 7 Files.readAllLines:

 List<String> lines = Files.readAllLines(Paths.get(path), encoding); 

Using a multi-threaded approach is probably not a good option, as it will cause the file system to execute random reads (which is never good on the file system)

-2
source

Source: https://habr.com/ru/post/915935/


All Articles