Algorithm for checking the inequality of ordered large collections

Ok, I need to check if two IEnumerable<T> equal. The order of the elements is important, which means that:

 {1, 2, 4, 1, 3} and {1, 2, 1, 3, 4} should not be equal. 

I saw several answers on this site explaining how to do this with linq : for example, here

The problem is that I have to repeatedly check the equality of fairly large collections (thousands of items) that are highly likely not to be equal, so performance is a factor to remember. The way I see it, all linq methods specified in the answer ( Count or Except ) should, if I am not mistaken, iterate over the entire collection, which is generally not needed.

I came up with this code that works pretty well (I think) and fast enough. I was wondering if I was missing some obvious built-in ways to do this (I don't want to reinvent the wheel here if possible.)

  public static bool IsEqualTo<T>(this IEnumerable<T> inner, IEnumerable<T> other) where T: IEquatable<T> { if (inner == null) throw new ArgumentNullException(); if (object.ReferenceEquals(inner, other)) return true; if (object.ReferenceEquals(other, null)) return false; using (var innerEnumerator = inner.GetEnumerator()) using (var otherEnumerator = other.GetEnumerator()) { while (innerEnumerator.MoveNext()) { if (!otherEnumerator.MoveNext() || !innerEnumerator.Current.Equals(otherEnumerator.Current)) return false; } return !otherEnumerator.MoveNext(); } } 
+5
source share
2 answers

Basically you are looking for the possibility of short circuit evaluation when the element is not found.

IEnumerable.SequenceEqual ( MSDN ) already does this; what was implemented in: http://referencesource.microsoft.com/#System.Core/System/Linq/Enumerable.cs (line 806)

When order is important, you should write a simple while loop:

 int i = 0; int aCount = a.Count(); //Use `IList` so you can use the property for efficiency int bCount = b.Count(); //Use `IList` so you can use the property for efficiency if (aCount != bCount) return false; while (a.ElementAt(i) == b.ElementAt(i)) i++; return i == aCount; 

Your function does basically the same thing and will work fine.

+8
source

If you often want to compare sequences, I would suggest you define a type that encapsulates an immutable sequence and implements ICollection together with IList<T> or ICollection<T> (you can define two types: one of which wraps IList<T> and implements ICollection and IList<T> , and one of them wraps IEnumerable<T> and implements ICollection and ICollection<T> ). This type should override Equals() and GetHashCode() and should have fields for cached accounts along with Int64 pairs and an Int32 field for common hash codes and, possibly, an Int64 sequence number field.

If the client code calls GetHashCode or if you need to list the elements to determine the number of elements in the wrapped collection, your code should list through the collection, calculate the hash values ​​for each element and use these calculations for a pair of 64-bit hash values ​​for the collection as a whole, and finally translate them into a 32-bit value suitable for using GetHashCode . Despite the fact that GetHashCode() requires only one 32-bit value, I would suggest calculating and saving more than for the reasons described below.

When performing an equality test, start with both objects wrapping the same collection. If so, they are equal. Otherwise, check if the collections contain the same number of elements and that the common hash codes match. If no condition applies, they are not equal. Otherwise, check the individual elements for each other. Please note that if hash codes have not yet been computed, it may be appropriate or not worth calculating (and checking) them before performing the equality test; some benchmarking may reveal if it is useful or harmful. If the collection ultimately gets hashed, it is better to do it sooner rather than later. On the other hand, if checking for equality on a collection of a million elements will consistently say β€œnot equal”, just by looking at the first element, and nothing else will need a hash value, calculating this would be a waste.

If two objects are found equal, it may be appropriate to replace the new shell wrapped in the object with a collection wrapped in an older object and make the new serial number of the object coincide with the older object. Doing this increases the likelihood that if the wrappers are compared again, they can be considered equal without checking any items. Note that there are various other methods that can be used to facilitate future equality exams that involve various memory compromises; unfortunately, the approach that will have the best typical behavior has very bad worst behavior. Also note that although any shell that caches hash values ​​fails if wrapped collections are modified externally, tracking the causes of such failures can be difficult if the above referenced replacements are made.

If you compare many unequal collections, the possibility of early exit using hash codes can be a serious victory in performance. In computing hash codes, I suggest you use a couple of "independent" methods for computing 64-bit hash codes. The reason for this is that depending on how the hash codes of the individual elements are computed, the probability of a system hash collision using a single hashing method may be unacceptably high. The cost of computing your own hash values ​​can be small compared to the cost of getting the hash values ​​of your components, so calculating two or three independent hash functions will be a cheap way to protect against system hash collisions.

0
source

Source: https://habr.com/ru/post/1204852/


All Articles