C # file write performance

Question

C # file write performance

Overview of my situation:

My task is to read lines from a file and reformat them to a more convenient format. After reformatting the input, I have to write it to the output file.

Here is an example of what needs to be done. Example file line:

ANO=2010;CPF=17834368168;YEARS=2010;2009;2008;2007;2006 <?xml version='1.0' encoding='ISO-8859-1'?><QUERY><RESTITUICAO><CPF>17834368168</CPF><ANO>2010</ANO><SITUACAODECLARACAO>Sua declaração não consta na base de dados da Receita Federal</SITUACAODECLARACAO><DATACONSULTA>05/01/2012</DATACONSULTA></RESTITUICAO><RESTITUICAO><CPF>17834368168</CPF><ANO>2009</ANO><SITUACAODECLARACAO>Sua declaração não consta na base de dados da Receita Federal</SITUACAODECLARACAO><DATACONSULTA>05/01/2012</DATACONSULTA></RESTITUICAO><RESTITUICAO><CPF>17834368168</CPF><ANO>2008</ANO><SITUACAODECLARACAO>Sua declaração não consta na base de dados da Receita Federal</SITUACAODECLARACAO><DATACONSULTA>05/01/2012</DATACONSULTA></RESTITUICAO><RESTITUICAO><CPF>17834368168</CPF><ANO>2007</ANO><SITUACAODECLARACAO>Sua declaração consta como Pedido de Regularização(PR), na base de dados da Secretaria da Receita Federal do Brasil</SITUACAODECLARACAO><DATACONSULTA>05/01/2012</DATACONSULTA></RESTITUICAO><RESTITUICAO><CPF>17834368168</CPF><ANO>2006</ANO><SITUACAODECLARACAO>Sua declaração não consta na base de dados da Receita Federal</SITUACAODECLARACAO><DATACONSULTA>05/01/2012</DATACONSULTA></RESTITUICAO><STATUS><RESULT>TRUE</RESULT><MESSAGE></MESSAGE></STATUS></QUERY>

This input file has two important information on each line: the CPF , which is the number of the document I will use, and the XML file (which represents the query return for the document in the database).

What should I achieve:

Each document in this old format has XML containing return requests for all years (from 2006 to 2010). After formatting it, each input line is converted to 5 lines of output:

 CPF=17834368168;YEARS=2010; <?xml version='1.0' encoding='ISO-8859-1'?><QUERY><RESTITUICAO><CPF>17834368168</CPF><ANO>2010</ANO><SITUACAODECLARACAO>Sua declaração não consta na base de dados da Receita Federal</SITUACAODECLARACAO><DATACONSULTA>05/01/2012</DATACONSULTA></RESTITUICAO><STATUS><RESULT>TRUE</RESULT><MESSAGE></MESSAGE></STATUS></QUERY> CPF=17834368168;YEARS=2009; <?xml version='1.0' encoding='ISO-8859-1'?><QUERY><RESTITUICAO><CPF>17834368168</CPF><ANO>2009</ANO><SITUACAODECLARACAO>Sua declaração não consta na base de dados da Receita Federal</SITUACAODECLARACAO><DATACONSULTA>05/01/2012</DATACONSULTA></RESTITUICAO><STATUS><RESULT>TRUE</RESULT><MESSAGE></MESSAGE></STATUS></QUERY> CPF=17834368168;YEARS=2008; <?xml version='1.0' encoding='ISO-8859-1'?><QUERY><RESTITUICAO><CPF>17834368168</CPF><ANO>2008</ANO><SITUACAODECLARACAO>Sua declaração não consta na base de dados da Receita Federal</SITUACAODECLARACAO><DATACONSULTA>05/01/2012</DATACONSULTA></RESTITUICAO><STATUS><RESULT>TRUE</RESULT><MESSAGE></MESSAGE></STATUS></QUERY> CPF=17834368168;YEARS=2007; <?xml version='1.0' encoding='ISO-8859-1'?><QUERY><RESTITUICAO><CPF>17834368168</CPF><ANO>2007</ANO><SITUACAODECLARACAO>Sua declaração consta como Pedido de Regularização(PR), na base de dados da Secretaria da Receita Federal do Brasil</SITUACAODECLARACAO><DATACONSULTA>05/01/2012</DATACONSULTA></RESTITUICAO><STATUS><RESULT>TRUE</RESULT><MESSAGE></MESSAGE></STATUS></QUERY> CPF=17834368168;YEARS=2006; <?xml version='1.0' encoding='ISO-8859-1'?><QUERY><RESTITUICAO><CPF>17834368168</CPF><ANO>2006</ANO><SITUACAODECLARACAO>Sua declaração não consta na base de dados da Receita Federal</SITUACAODECLARACAO><DATACONSULTA>05/01/2012</DATACONSULTA></RESTITUICAO><STATUS><RESULT>TRUE</RESULT><MESSAGE></MESSAGE></STATUS></QUERY>

One line containing information about this document every year. Thus, the output files are 5 times longer than the input files.

Performance issue:

Each file has 400,000 lines, and I have 133 files to process.

This is currently my application thread:

Open file
Read line
Disassemble it in a new format
Enter a string in the output file
Go to 2 until the left line is
Goto1 until the left file remains

Each input file is about 700 MB, and it always reads files and writes the converted version to another. To achieve this process, a 400 KB file takes ~ 30 seconds.

Additional Information:

My machine runs on an Intel i5 processor with 8 GB of RAM.

I do not create tons of objects to avoid mem. leak and I use the using clause when opening the input file.

What to do to run it faster?

+4

performance c # file io

Marcello grechi lins Feb 24 '12 at 20:03

source share

4 answers

It looks like a great candidate for conveyor processing .

The main idea is to have 3 parallel Task s, one for each "stage" in the pipeline, exchanging with each other via queues ( BlockingCollection ):

The first task reads the input file in turn and puts the read lines in the queue.
The second task gets the lines from the queue, formats them, and puts the result in another queue.
The third task receives formatted results from the second queue and writes them to the resulting file.

Ideally, task 1 should not wait for task 2 to complete before moving on to the next file.

You can even go crazy and put each separate file conveyor in a separate parallel task, but this can greatly ruin your HDD head, it will probably hurt more than it helps. On the other hand, for SSDs this can be justified - in any case, before a decision is made.

--- EDIT ---

Using a single-threaded implementation of John Skeet as the basis, here is what the pipeline version looks like (working example):

 class Test { struct Queue2Element { public string CPF; public List<string> Years; public string XML; } public static void Main() { Stopwatch stopwatch = Stopwatch.StartNew(); var queue1 = new BlockingCollection<string>(); var task1 = new Task( () => { foreach (var line in File.ReadLines("input.txt")) queue1.Add(line); queue1.CompleteAdding(); } ); var queue2 = new BlockingCollection<Queue2Element>(); var task2 = new Task( () => { foreach (var line in queue1.GetConsumingEnumerable()) queue2.Add( new Queue2Element { CPF = ParseCpf(line), XML = ParseXml(line), Years = ParseYears(line).ToList() } ); queue2.CompleteAdding(); } ); var task3 = new Task( () => { var lines = from element in queue2.GetConsumingEnumerable() from year in element.Years select element.CPF + year + element.XML; File.WriteAllLines("output.txt", lines); } ); task1.Start(); task2.Start(); task3.Start(); Task.WaitAll(task1, task2, task3); stopwatch.Stop(); Console.WriteLine("Completed in {0}ms", stopwatch.ElapsedMilliseconds); } // Returns the CPF, in the form "CPF=xxxxxx;" static string ParseCpf(string line) { int start = line.IndexOf("CPF="); int end = line.IndexOf(";", start); // TODO: Validation return line.Substring(start, end + 1 - start); } // Returns a sequence of year values, in the form "YEAR=2010;" static IEnumerable<string> ParseYears(string line) { // First year. int start = line.IndexOf("YEARS=") + 6; int end = line.IndexOf(" ", start); // TODO: Validation string years = line.Substring(start, end - start); foreach (string year in years.Split(';')) { yield return "YEARS=" + year + ";"; } } // Returns all the XML from the leading space onwards static string ParseXml(string line) { int start = line.IndexOf(" <?xml"); // TODO: Validation return line.Substring(start); } }

As it turned out, the parallel version is a little faster than the serial version. Apparently, the task is associated with a large amount of input-output than with anything else, so pipelining does not help much. If you increase the amount of processing (for example, add robust validation), this may make a difference in favor of parallelism, but for now, you probably prefer to just focus on sequential improvements (as John Skeet himself noted, the code is not as fast as it can be )

(In addition, I tested with cached files - I wonder if there is a way to clear the Windows file cache and see if the hardware insertion I / O depth 2 allows the hard drive to optimize head movements compared to the I / O depth 1 of the serial version.)

+5

Branko dimitrijevic Feb 24 '12 at 20:24

source share

This is definitely not an I / O problem - check your processing, use the profiler to find out who and where stores all the temporary sheets.

Show your processing code, maybe you are using some inefficient string operations ...

+2

Oleg Dok Feb 24 '12 at 20:07

source share

There are a few basic things you can do right away ...

Run multiple threads to process multiple files at once.
Use StringBuilder or StringBuffer instead of string concat
If you use an XmlDocument to parse XML files, replace it with XmlTextReader and XmlTextWriter
Do not convert a string to numbers and return to strings if you do not need it.
Remove any unnecessary string operations. For example, do not do str.Contains just so that str.IndexOf is on the next line. Instead, call str.IndexOf, save the result in a local var, and check if there is> 0.

Do the different parts of your algorithm yourself and measure the time. Start by reading the entire file line by line and measure it. Write the same lines back to the new file and measure it. Separate prefix information from xml and measure it. Parse xml .... This way you will know what the bottleneck is and focus on this part.

+1

Aleks Feb 24 '12 at 20:20

source share

Jon skeet · Accepted Answer · 2012-02-24T20:23:26+0000

I don’t know what your code looks like, but here is an example that on my mailbox (though with SSD and i7, but ...) processes a 400K file in about 50 ms.

I did not even think about optimization - I wrote it in the cleanest way. (Note that all this is lazily evaluated, File.ReadLines and File.WriteAllLines take care of opening and closing files.)

 using System; using System.Collections.Generic; using System.Diagnostics; using System.IO; using System.Linq; class Test { public static void Main() { Stopwatch stopwatch = Stopwatch.StartNew(); var lines = from line in File.ReadLines("input.txt") let cpf = ParseCpf(line) let xml = ParseXml(line) from year in ParseYears(line) select cpf + year + xml; File.WriteAllLines("output.txt", lines); stopwatch.Stop(); Console.WriteLine("Completed in {0}ms", stopwatch.ElapsedMilliseconds); } // Returns the CPF, in the form "CPF=xxxxxx;" static string ParseCpf(string line) { int start = line.IndexOf("CPF="); int end = line.IndexOf(";", start); // TODO: Validation return line.Substring(start, end + 1 - start); } // Returns a sequence of year values, in the form "YEAR=2010;" static IEnumerable<string> ParseYears(string line) { // First year. int start = line.IndexOf("YEARS=") + 6; int end = line.IndexOf(" ", start); // TODO: Validation string years = line.Substring(start, end - start); foreach (string year in years.Split(';')) { yield return "YEARS=" + year + ";"; } } // Returns all the XML from the leading space onwards static string ParseXml(string line) { int start = line.IndexOf(" <?xml"); // TODO: Validation return line.Substring(start); } }

C # file write performance

--- EDIT ---

More articles: