How are GroupBy objects by numerical values โ€‹โ€‹with a tolerance factor?

I have a list of C # objects with the following simplified data:

ID, Price 2, 80.0 8, 44.25 14, 43.5 30, 79.98 54, 44.24 74, 80.01 

I try GroupBy to use the lowest number considering tolerance. for example, in the case of tolerance = 0.02, my expected result should be:

 44.24 -> 8, 54 43.5 -> 14 79.98 -> 2, 30, 74 

How can I do this while achieving good performance for large datasets? Is there a LINQ way in this case?

+6
source share
2 answers

It seemed to me that if you have a large data set, you would want to avoid the simple decision of sorting the values, and then collecting them when repeating over a sorted list, since sorting a large collection can be expensive. The most effective solution that I could think of that does not do any sorting explicitly was to build a tree in which each node contains elements in which the key falls into the "adjacent" range (where all keys are within each other's tolerance ) - the range for each node expands each time an element is added, which goes beyond the range of less than tolerance . I implemented a solution that turned out to be more complex and interesting than I expected - and, based on my rough benchmarking, it seems that this is done about half the time than a simple solution.

Here's my implementation as an extension method (so you can chain it, although, like the regular Group method, it will iterate over source completely as soon as the IEnumerable result is executed).

 public static IEnumerable<IGrouping<double, TValue>> GroupWithTolerance<TValue>( this IEnumerable<TValue> source, double tolerance, Func<TValue, double> keySelector) { if(source == null) throw new ArgumentNullException("source"); return GroupWithToleranceHelper<TValue>.Group(source, tolerance, keySelector); } private static class GroupWithToleranceHelper<TValue> { public static IEnumerable<IGrouping<double, TValue>> Group( IEnumerable<TValue> source, double tolerance, Func<TValue, double> keySelector) { Node root = null, current = null; foreach (var item in source) { var key = keySelector(item); if(root == null) root = new Node(key); current = root; while(true){ if(key < current.Min - tolerance) { current = (current.Left ?? (current.Left = new Node(key))); } else if(key > current.Max + tolerance) {current = (current.Right ?? (current.Right = new Node(key)));} else { current.Values.Add(item); if(current.Max < key){ current.Max = key; current.Redistribute(tolerance); } if(current.Min > key) { current.Min = key; current.Redistribute(tolerance); } break; } } } foreach (var entry in InOrder(root)) { yield return entry; } } private static IEnumerable<IGrouping<double, TValue>> InOrder(Node node) { if(node.Left != null) foreach (var element in InOrder(node.Left)) yield return element; yield return node; if(node.Right != null) foreach (var element in InOrder(node.Right)) yield return element; } private class Node : IGrouping<double, TValue> { public double Min; public double Max; public readonly List<TValue> Values = new List<TValue>(); public Node Left; public Node Right; public Node(double key) { Min = key; Max = key; } public double Key { get { return Min; } } IEnumerator IEnumerable.GetEnumerator() { return GetEnumerator(); } public IEnumerator<TValue> GetEnumerator() { return Values.GetEnumerator(); } public IEnumerable<TValue> GetLeftValues(){ return Left == null ? Values : Values.Concat(Left.GetLeftValues()); } public IEnumerable<TValue> GetRightValues(){ return Right == null ? Values : Values.Concat(Right.GetRightValues()); } public void Redistribute(double tolerance) { if(this.Left != null) { this.Left.Redistribute(tolerance); if(this.Left.Max + tolerance > this.Min){ this.Values.AddRange(this.Left.GetRightValues()); this.Min = this.Left.Min; this.Left = this.Left.Left; } } if(this.Right != null) { this.Right.Redistribute(tolerance); if(this.Right.Min - tolerance < this.Max){ this.Values.AddRange(this.Right.GetLeftValues()); this.Max = this.Right.Max; this.Right = this.Right.Right; } } } } } 

You can switch double to another type if you need (I would like C # to have a numeric generic constraint).

+4
source

Here's an implementation of a simpler sort and collect method that Steve avoided.

 public static class EnumerableExtensions { public static IEnumerable<IGrouping<double, T>> GroupByWithTolerance<T>(this IEnumerable<T> source, Func<T, double> keySelector, double tolerance) { var orderedSource = source .Select(e => new {Key = keySelector(e), Value = e}) .OrderBy(e => e.Key); if (!orderedSource.Any()) yield break; var prev = orderedSource.First(); var itemGroup = new Group<double, T>(prev.Key) {prev.Value}; foreach (var current in orderedSource.Skip(1)) { if (current.Key - prev.Key <= tolerance) { itemGroup.Add(current.Value); } else { yield return itemGroup; itemGroup = new Group<double, T>(current.Key) {current.Value}; } prev = current; } yield return itemGroup; } private class Group<TKey, TSource> : List<TSource>, IGrouping<TKey, TSource> { public Group(TKey key) { Key = key; } public TKey Key { get; } } } 

EDIT

Sample Usage:

 [Test] public void Test() { var items = new[] { new Item {Id = 2, Price = 80.0}, new Item {Id = 8, Price = 44.25}, new Item {Id = 14, Price = 43.5}, new Item {Id = 30, Price = 79.98}, new Item {Id = 54, Price = 44.24}, new Item {Id = 74, Price = 80.01} }; var groups = items.GroupByWithTolerance(i => i.Price, 0.02); foreach (var itemGroup in groups) { var groupString = string.Join(", ", itemGroup.Select(i => i.ToString())); System.Console.WriteLine($"{itemGroup.Key} -> {groupString}"); } } private class Item { public int Id { get; set; } public double Price { get; set; } public override string ToString() => $"[ID: {Id}, Price: {Price}]"; } 

Output:

 43.5 -> [ID: 14, Price: 43.5] 44.24 -> [ID: 54, Price: 44.24], [ID: 8, Price: 44.25] 79.98 -> [ID: 30, Price: 79.98], [ID: 2, Price: 80], [ID: 74, Price: 80.01] 
0
source

Source: https://habr.com/ru/post/974015/


All Articles