How to write obscene amount of data to a file?

I am developing an application that reads lines from huge text files (~ 2.5 GB), manipulates each line in a specific format, and then writes each line to a text file. After the output text file was closed, the program "Bulk Inserts" (SQL Server) transfers the data to my database. It works, it just slows down.

I use StreamReader and StreamWriter .

I am almost stuck in reading one line at a time because I need to manipulate the text; however, I think that if I made a collection of lines and wrote out a collection every 1000 lines or so, this would speed up the work at least a little. The problem (and this may be purely from my ignorance) is that I cannot write string[] using StreamWriter . After exploring StackOverflow and the rest of the Internet, I came across File.WriteAllLines , which allows me to write string[] to a file, but I don’t think my computer’s memory can handle 2.5 GB of data that will be stored at a time. In addition, the file is created, filled and closed, so I need to make a ton of smaller files in order to break 2 GB text files just to insert them into the database. Therefore, I would rather stay away from this option.

One hack job that I can think of is doing a StringBuilder and using the AppendLine method to add each line to create a giant line. Then I could convert this StringBuilder to a string and write it to a file.

But enough of my assumption. The method I have already implemented works, but I wonder if anyone can suggest a better way to write pieces of data to a file?

+4
source share
3 answers

Two things will increase output speed with StreamWriter .

First, make sure the output file is on a different physical disk than the input file. If the input and output are on the same disk, then very often reading should wait while records and records must wait for reading. A drive can only do one thing at a time. Obviously, not every read or write is expected, because StreamReader reads into the buffer and parses the lines from it, and StreamWriter writes to the buffer and then pushes it to disk when the buffer is full. With input and output files on separate disks, your reads and writes overlap.

What do I mean, do they overlap? The operating system, as a rule, reads to you in advance, so it can buffer your file during processing. And when you write, the OS usually buffers and writes to disk lazily. Thus, a certain amount of asynchronous processing occurs.

The second is to increase the size of the buffer. The default buffer size for StreamReader and StreamWriter is 4 kilobytes. Thus, every 4K reads or writes a call to the operating system. And, quite likely, a disk operation.

If you increase the buffer size to 64 KB, you make 16 times less OS calls and 16 times less disk operations (not strictly true, but close). Switching to a 64-kilobyte buffer can reduce more than 25% of the I / O time, and it is dead simply:

 const int BufferSize = 64 * 1024; var reader = new StreamReader(filename, Encoding.UTF8, true, BufferSize); var writer = new StreamWriter(filename, Encoding.UTF8, BufferSize); 

These two things will speed up your I / O the most you can do. Trying to create buffers in memory using StringBuilder is just an unnecessary job that does a poor job of duplicating what you can achieve by increasing the size of the buffer and doing it wrong can make your program slower.

I would caution against a buffer size exceeding 64 KB. On some systems, you get minimally better results with buffers up to 256 KB, but on others, you get significantly worse performance - 50% slower! I have never seen a system work better with buffers larger than 256 KB than with buffers of 64 KB. In my experience, 64K is a sweet spot.

Another thing you can do is use three streams: reader, processor, and writer. They communicate with the queues. This can reduce the total time from (input-time + process-time + output-time) to a level close to max(input-time, process-time, output-time) . And with .NET it is very easy to configure. See My Blog Posts: Simple Multithreading, Part 1 and Simple Multithreading, Part 2 .

+10
source

According to docs, StreamWriter is not automatically erased after each recording by default, therefore it is buffered.

You can also use some of the lazy methods in the File class, for example:

 File.WriteAllLines("output.txt", File.ReadLines("filename.txt").Select(ProcessLine)); 

where ProcessLine declared as follows:

 private string ProcessLine(string input) { string result = // do some calculation on input return result; } 

Since ReadLines lazy and WriteAllLines has lazy overload, it will transfer the file, not try to read it all.

+9
source

How about creating lines for recording?

Sort of

 int cnt = 0; StringBuilder s = new StringBuilder(); while(line = reader.readLine()) { cnt++; String x = (manipulate line); s.append(x+"\n"); if(cnt%10000 == 0) { StreamWriter.write(s); s=new StringBuilder(); } } 

Edited because the comment below is right, should have used stringbuilder.

+1
source

Source: https://habr.com/ru/post/1493539/


All Articles