The following list of lines is provided:
string[] Itens = new string[] { "hi", " hi ", "HI", "hí", " Hî", "hi hi", " hí hí ", "olá", "OLÁ", " olá ", "", "ola", "hola", " holà ", "aaaa", "áâàa", " aâàa ", "áaàa", "áâaa ", "aaaa ", "áâaa", "áâaa", };
The result of the Distinct operation should be:
hi, hi hi, olá, , hola, aaaa
C # The excellent operation available for IEnumerable takes IEqualityComparer as a parameter so that we can personalize the comparison.
The following implementations get the job done
class LengthHash : IEqualityComparer<string>
{
public bool Equals(string x, string y)
{
if (x == null || y == null) return x == y;
var xt = x.Trim();
var yt = y.Trim();
return xt.Length == yt.Length && Culture.CompareInfo.IndexOf(xt, yt, CompareOptions.IgnoreNonSpace | CompareOptions.IgnoreCase) >= 0;
}
public int GetHashCode(string obj) => obj?.Trim().Length ?? 1;
}
If GetHashCode is different, Equals doesn't even execute, so it's important to have a good implementation.
I tried changing GetHashCode for another 2 different approaches.
Ignorehash
public int GetHashCode(string obj) => 1;
Normalizedhash
public int GetHashCode(string obj) => obj?.Trim().Normalize().ToUpperInvariant().GetHashCode() ?? 1;
Besides using a personalized IEqualityComparer, I also tried to crop the list before doing StringComparer.InvariantCultureIgnoreCase, but it produces the same result as for the Normalize and Upper versions.
Distinct, StringComparer.InvariantCultureIgnoreCase 3 :
Method | Mean | StdErr | StdDev | Median |
------------------------------------ |----------- |---------- |---------- |----------- |
RunDefault | 2.2224 us | 0.0242 us | 0.2391 us | 2.1414 us |
RunHashAsLength | 6.0765 us | 0.0515 us | 0.1857 us | 6.1235 us |
RunIgnoreHash | 6.4078 us | 0.0640 us | 0.6140 us | 6.1982 us |
RunNormalizedHash | 14.5941 us | 0.0742 us | 0.3556 us | 14.4983 us |
RunTrimAndCompareWithStringComparer | 14.4935 us | 0.0213 us | 0.0768 us | 14.5352 us |
:
21 Default: hi, hi , HI, hí, Hî, hi hi, hí hí , olá, OLÁ, olá , , ola, hola, holà , aaaa, áâàa, aâàa , áaàa, áâaa , aaaa , áâaa
6 HashAsLength: hi, hi hi, olá, , hola, aaaa
6 IgnoreHash: hi, hi hi, olá, , hola, aaaa
15 NormalizedHash: hi, hí, Hî, hi hi, hí hí , olá, , ola, hola, holà , aaaa, áâàa, aâàa , áaàa, áâaa
15 RunTrimAndCompareWithStringComparer: hi, hí, Hî, hi hi, hí hí, olá, , ola, hola, holà, aaaa, áâàa, aâàa, áaàa, áâaa
https://gist.github.com/Flash3001/d50a6b43bba7bc61e3d85734e40dbed9
: ? GetHashCode, Equals IEqualityComparer.