Divide the one-time large IEnumerable <T> in half using the condition

Question

Divide the one-time large IEnumerable <T> in half using the condition

Let's say we have a class Foo :

 public class Foo { public DateTime Timestamp { get; set; } public double Value { get; set; } // some other properties public static Foo CreateFromXml(Stream str) { Foo f = new Foo(); // do the parsing return f; } public static IEnumerable<Foo> GetAllTheFoos(DirectoryInfo dir) { foreach(FileInfo fi in dir.EnumerateFiles("foo*.xml", SearchOption.TopDirectoryOnly)) { using(FileStream fs = fi.OpenRead()) yield return Foo.CreateFromXML(fs); } } }

To get a perspective, I can say that the data in these files is written for about 2 years with a frequency of usually several Foo every minute.

Now: we have a parameter called TimeSpan TrainingPeriod , which is about 15 days. What I would like to accomplish is to call:

 var allTheData = GetAllTheFoos(myDirectory);

and get the IEnumerable<Foo> TrainingSet, TestSet , where the TrainingSet consists of Foos from the first 15 days of recording and the TestSet everyone else. Then from TrainingSet we want to calculate some data with constant memory (for example, average Value , some linear regressions, etc.), and then start using TestSet using the calculated values. In other words, my code should be semantically equivalent:

 TimeSpan TrainingPeriod = new TimeSpan(15, 0, 0); // hope it says 15 days var allTheData = GetAllTheFoos(myDirectory); List<Foo> allTheDataList = allTheData.ToList(); var threshold = allTheDataList[0].Timestamp + TrainingPeriod; List<Foo> TrainingSet = allTheDataList.Where(foo => foo.Timestamp < threshold).ToList(); List<Foo> TestSet = allTheDataList.Where(foo => foo.Timestamp >= threshold).ToList();

By the way, the XML file naming convention guarantees that Foos will be returned in chronological order. Of course, I do not want to store all this in memory, which happens every time .ToList() called. So I came up with another solution:

 TimeSpan TrainingPeriod = new TimeSpan(15, 0, 0); var allTheData = GetAllTheFoos(myDirectory); var threshold = allTheDataList.First().Timestamp + TrainingPeriod; // a minor issue var grouped = from foo in allTheData group foo by foo.Timestamp < Training; var TrainingSet = grouped.First(g => g.Key); var TestSet = grouped.First(g => !g.Key); // the major one

However, there is a small and serious problem with this piece of code. The disadvantage is that the first file is read at least twice - in fact it does not matter. But it looks like TrainingSet and TestSet access the directory independently, read each file twice, and select only those that contain a specific timestamp restriction. I'm not too puzzled by this - in fact, if it worked, I would be puzzled and would again have to revise LINQ. But this causes problems with access to files, and each file is analyzed twice, which is a total loss of processor time.

So my question is: can I achieve this effect using only simple LINQ / C # tools? I think I can do this in good mode by overriding some of the GetEnumerator() , MoveNext() methods, etc. - please do not bother typing, I can completely cope with this on my own.

However, if there is some elegant, short and pleasant solution for this, it would be very useful.

Thanks!

Other editing:

Finally, I came to the following:

 public static void Handle(DirectoryInfo dir) { var allTheData = Foo.GetAllTheFoos(dir); var it = allTheData.GetEnumerator(); it.MoveNext(); TimeSpan trainingRange = new TimeSpan(15, 0, 0, 0); DateTime threshold = it.Current.Timestamp + trainingRange; double sum = 0.0; int count = 0; while(it.Current.Timestamp <= threshold) { sum += it.Current.Value; count++; it.MoveNext(); } double avg = sum / (double)count; // now I can continue on with the 'it' IEnumerator }

Of course, some minor problems are present, that is, it is very important to display the result of MoveNext () (is this the end of IEnumerable?), But the general idea is clear, I hope. BUT in real code this is not just the average value that I calculate, but different types of regression, etc. So I would like to somehow extract the first part, pass it as IEnumerable to a class derived from mine

 public abstract class AbstractAverageCounter { public abstract void Accept(IEnumerable<Foo> theData); public AverageCounterResult Result { get; protected set; } }

Separate responsibilities for the extraction of training data and its processing. Plus, after the process described before I get IEnumerator<Foo> , but I think that IEnumerable<Foo> would be preferable to pass it to my instance of TheRestOfTheDataHandler .

+6

performance c # xml linq bigdata

Wojciech kozaczewski Feb 16 '15 at 13:53

source share

1 answer

George Polevoy · Accepted Answer · 2015-02-16T20:32:43+0000

You can try to display a stateful iterator template over the ienumerator obtained from the original ienumerable.

 IEnumerable<T> StatefulTake(IEnumerator<T> source, Func<bool> getDone, Action setDone);

This method simply checks, calls MoveNext, returns Current, and is updated if movenext returns false.

Then you break up your set with subsequent calls to this method and perform partial enumeration using the following methods, for example: TakeWhile Any First ... Then you can perform any operations on top of this, but each of them must be listed to the end.

 var source = GetThemAll(); using (var e = source.GetEnumerator()){ bool done=!source.MoveNext(); foreach(var i in StatefulTake(e, ()=>done,()=>done=true).TakeWhile(i=>i.Time<...)){ //... } var theRestAverage = StatefulTake(e,()=>done,()=>done=true).Avg(i=>i.Score); //... }

Its a template that I often use in my asynchronous toolbox.

Update: the signature of the StatefulTake method is fixed, it cannot use the ref parameter. An initial call to MoveNext is also required. The three types of done varable referencess and the method itself must be encapsulated in a context class.

Divide the one-time large IEnumerable <T> in half using the condition

More articles: