What is the fastest way to write large amounts of data from memory to a file?

I have a program that generates a lot of data and puts it in the queue for writing, but the problem is that it generates data faster than I write now (which leads to maximum memory and starts to slow down). The order does not matter, as I plan to analyze the file later.

I looked around a bit and found some questions that helped me design my current process (but I still find it slow). Here is my code:

//...background multi-threaded process keeps building the queue.. FileWriter writer = new FileWriter("foo.txt",true); BufferedWriter bufferWritter = new BufferedWriter(writer); while(!queue_of_stuff_to_write.isEmpty()) { String data = solutions.poll().data; bufferWritter.newLine(); bufferWritter.write(data); } bufferWritter.close(); 

I am new to programming, so I may not evaluate it correctly (maybe a hardware problem, since I use EC2), but is there a quick way to dump the results of the queue to a file, or if my approach is ok, am I improving it somehow? How the order does not matter, does it make sense to write multiple files on multiple disks? Will slicing make it faster? Etc. I'm not quite sure that the best approach and any suggestions would be wonderful. My goal is to save the results of the queue (sorry, it does not output to / dev / null :-) and do not reduce the memory consumption as low as possible for my application (I am not 100% sure, but the queue is filled with 15gig, so I Assuming this will be 15gig + file.

The fastest way to write huge data to a Java text file (I realized I have to use a buffer writer) Writing a file in Java on Windows in parallel (made me realize that maybe multithreaded writing was not a great idea)

+6
source share
4 answers

Looking at this code, one thing that comes to mind is character encoding. You write strings, but, ultimately, these are bytes that go to streams. Encoding from alphanumeric recording under the hood and is performed in the same stream that processes the recording. This may mean that there is time spent on coding that delays recording, which can reduce the speed of writing data.

A simple change would be to use the byte[] queue instead of String , encode the threads that click on the queue, and use the BufferedOutputStream I / O code instead of BufferedWriter .

It can also reduce memory consumption if the average encoded text takes up less than two bytes per character. For Latin text and UTF-8 encoding, this is usually true.

However, I suspect you are just generating data faster than your IO subsystem can handle it. You will need to make your I / O subsystem faster - either with a faster one (if you use EC2, perhaps rent a faster instance or write to another backend - SQS vs EBS and local disk, etc.), Or with using ganging several IO subsystems together somehow parallel.

+2
source

Yes, the help of multiple files on multiple disks should help, and if no one else writes these disks at the same time, performance should scale linearly with the number of disks until I / O becomes a bottleneck. You can also try a couple more optimizations to increase productivity even more.

If you create huge files and the disk just can't keep up, you can use GZIPOutputStream to compress the output, which in turn will reduce the number of disk I / O. For nonrandom text, you can usually expect a compression ratio of at least 2x-10x.

  //...background multi-threaded process keeps building the queue.. OutputStream out = new FileOutputStream("foo.txt",true); OutputStreamWriter writer = new OutputStreamWriter(new GZIPOutputStream(out)); BufferedWriter bufferWriter = new BufferedWriter(writer); while(!queue_of_stuff_to_write.isEmpty()) { String data = solutions.poll().data; bufferWriter.newLine(); bufferWriter.write(data); } bufferWriter.close(); 

If you output regular (i.e., repeating) data, you can also consider switching to another output format - for example, binary data encoding. Depending on the structure of your data, it may be more efficient to store it in a database. If you are outputting XML and really want to stick with XML, you should study the Binary XML format , such as EXI or Fast InfoSet.

+1
source

I guess that while you are generating your data from calculations and not loading your data from another data source, writing will always be slower than generating your data.

You can try to write your data in several files (not in the same file → due to synchronization problems) in several streams (but I think this will not solve your problem).

Is it possible for you to wait for the application part of your application to complete its work and continue computing?

Another approach: do you empty your turn? Does solutions.poll () reduce the queue of your decisions?

0
source

Writing to different files using multiple streams is a good idea. In addition, you should study the BufferWriters buffer size setting, which you can do from the constructor. Try initializing with a 10 MB buffer and see if this helps

0
source

Source: https://habr.com/ru/post/912768/


All Articles