To make an IO file with LINQ more efficient with more small XML files?

Question

To make an IO file with LINQ more efficient with more small XML files?

I have a batch of about 13 thousand XML files (and grows potentially hundreds per day) that I need to process by LINQ filtering and converting the data to what I need, and aggregating each of the seven possible event types into one event type file (see below). So, 13k files into 7 files. Event types are well defined in XML, so filtering and aggregation are relatively simple. These aggregated files will then be used to create the MySQL insert statement in our database using a script that I already wrote, which also works well.

I have a functional code, and it processes files, but it has been working for 23+ hours so far and looks, probably, only about half (?). I neglected to insert a file counter, so I really don't know, and I don't want to restart it again. I can make reasonable assumptions, judging by the size of the source files (360 mb or so) compared to the processed file sizes (180 mb or so). I expect that you will need to run this about a half dozen times until we upload this data collection method (using XML files as a database) and move on to using MySQL exclusively, so I hope I can find a more efficient method for processing files . I do not want to spend potentially 2 days on execution if I do not need it.

It works locally on my machine, but only on 1 HD (10km Barracuda RPM, I think). Perhaps it will be faster reading from one disc and writing to a separate disc? I am sure that my bottlenecks are caused by the IO file, I open and close files literally thousands of times. Maybe I can reorganize only once for reading and do everything in memory? I know that it will be faster, but I risk losing the whole data file if something goes wrong. I still have to open each of the 13k files in order to read, process and write them to XElement.

Here is the code I'm running. I use LINQPad and run the code as C # statements, but if necessary, I can turn it into a real executable. LINQPad is as easy to prototype as it is! Please let me know if the XML examples are easier to understand, but at first glance this does not seem native. File sizes range from 2k to 285k, but only 300 or so exceeds 100k, most of them are in the range of 25-50K.

string sourceDir = @"C:\splitXML\results\XML\";//source for the 13k files string xmlDestDir = @"C:\results\XMLSorted\";//destination for the resultant 7 files List<string> sourceList = new List<string>(); sourceList = Directory.EnumerateFiles(sourceDir, "*.xml", SearchOption.AllDirectories).ToList(); string destFile = null; string[] events = { "Creation", "Assignment", "Modification", "Repair", "RepairReview", "Termination", "Test" }; foreach(string eventItem in events) { try { //this should only happen once the first time through and //shouldn't be a continuing problem destFile = Path.Combine(xmlDestDir, eventItem + "Uber.xml"); if (!File.Exists(destFile)) { XmlTextWriter writer = new XmlTextWriter( destFile, null ); writer.WriteStartElement( "PCBDatabase" ); writer.WriteEndElement(); writer.Close(); } } catch(Exception ex) { Console.WriteLine(ex); } } foreach(var file in sourceList) //roughly 13k files { XDocument xd = XDocument.Load(file); var actionEvents = from e in xd.Descendants("PCBDatabase").Elements() select e; foreach(XElement actionEvent in actionEvents) { //this is where I think it bogging down, it constant file IO var eventName = from e in actionEvents.Elements() select e.Name; var eventType = eventName.First(); destFile = Path.Combine(xmlDestDir, eventType + "Uber.xml"); //another bottle neck opening each file thousands of times XElement xeDoc = XElement.Load(destFile); xeDoc.Add(actionEvent); //and last bottle neck, closing each file thousands of times xeDoc.Save(destFile); } }

+4

c # xml file-io linq

delliottg Aug 27 '13 at 15:02

source share

3 answers

You have made the classic antipattern: Schlemiel the Painter .

With each file, you re-read one of the uber XML, modify it and re-write it completely ... So the more files you have already processed, the slower it will process a new file. Given the total size of your files, it might be better to store the uber files in memory and only write them at the end of the process.

Another possible solution is to save different XmlWriter (s), one for each of the uber files, and write to them. They are stream-based, so you can always add new elements, and if they are supported by FileStream , these records will be saved in files.

+2

xanatos Aug 27 '13 at 15:10

source share

Writing to the result file (and, more importantly, loading it every time you want to add an item) really kills you. Saving all the data that you want to write to memory is also problematic, if not for any other reason, then you may not have enough memory to do this. You need medium soil, and that means you want to dose. Read in a few hundred elements, store them in a structure in memory, and then when it gets big enough (play with the batch size to see what works best), write them all to the output file (s).

Therefore, we will start with this Batch function, which unloads IEnumerable :

 public static IEnumerable<IEnumerable<T>> Batch<T>(this IEnumerable<T> source, int batchSize) { List<T> buffer = new List<T>(batchSize); foreach (T item in source) { buffer.Add(item); if (buffer.Count >= batchSize) { yield return buffer; buffer = new List<T>(batchSize); } } if (buffer.Count >= 0) { yield return buffer; } }

Then, the query you use can actually be reorganized to use LINQ more efficiently. You have several options that actually do nothing and can use SelectMany instead of explicit foreach loops to pull it all into a single request.

 var batchesToWrite = sourceList.SelectMany(file => XDocument.Load(file).Descendants("PCBDatabase").Elements()) .Select((element, index) => new { element, index, file = Path.Combine(xmlDestDir, element.Elements().First().Name + "Uber.xml"), }) .Batch(batchsize) .Select(batch => batch.GroupBy(element => element.file));

Then simply write out each of the batches:

 foreach (var batch in batchesToWrite) { foreach (var group in batch) { WriteElementsToFile(group.Select(element => element.element), group.Key); } }

Regarding the actual writing of the elements in the file, I extracted this from the method, because there are probably different ways of writing your output. You can start with the implementation you are using, just to find out how you do it:

 private static void WriteElementsToFile(IEnumerable<XElement> elements, string path) { XElement xeDoc = XElement.Load(path); foreach (var element in elements) xeDoc.Add(element); xeDoc.Save(path); }

But you still have a problem that you read in the entire input file to add items to the end. Only one of them can reduce this enough for your purposes, but if it is not, you can use this method yourself, perhaps using something other than LINQ to XML to write the results so that you don’t have to download everything file in memory to create this document.

+2

Servy Aug 27 '13 at 15:29

source share

mike z · Accepted Answer · 2013-08-27T18:32:38+0000

You spend a huge amount of time reopening your xml files and parsing them into XDocument objects. Since these Uber files will be quite large, you want to open them once and write only forward. The code below is a sample of how you will do this. I also moved from eventType from the inner loop (since it was not dependent on the loop's inner variable).

Note that this sample will recreate Uber files from scratch each time. If this is not what you need to do, then I would suggest instead of reading them in an XDocument , using the code below to create temp files, and then use two instances of XmlReader to read the files and merge the contents using XmlWriter .

 using System.IO; using System.Xml; using System.Xml.Linq; using System.Linq; public static void Main(string[] args) { string sourceDir = @"C:\splitXML\results\XML\"; string xmlDestDir = @"C:\results\XMLSorted\"; string[] events = { "Creation", "Assignment", "Modification", "Repair", "RepairReview", "Termination", "Test" }; Dictionary<string, XmlWriter> writers = events.ToDictionary(e => e, e => XmlWriter.Create(Path.Combine(xmlDestDir, e + "Uber.xml"))); foreach(var writer in writers.Values) { writer.WriteStartDocument(); writer.WriteStartElement("PCBDatabase"); } foreach(var file in Directory.EnumerateFiles(sourceDir, "*.xml", SearchOption.AllDirectories)) //roughly 13k files { XDocument xd = XDocument.Load(file); var actionEvents = from e in xd.Descendants("PCBDatabase").Elements() select e; string eventType = (from e in actionEvents.Elements() select e.Name.ToString()).First(); foreach(XElement actionEvent in actionEvents) { actionEvent.WriteTo(writers[eventType]); } } foreach(var writer in writers.Values) { writer.WriteEndElement(); writer.WriteEndDocument(); writer.Close(); } }

To make an IO file with LINQ more efficient with more small XML files?

More articles: