The union of the intersections of 2-set sequence sequence combinations

Question

The union of the intersections of 2-set sequence sequence combinations

How can I find a set of elements that occur in 2 or more sequences in a sequence of sequences?

In other words, I want different values to occur in at least 2 of the transmitted sequences.

Note: This is not the intersection of all sequences, but rather the union of the intersection of all pairs of sequences.

Note 2: A pair does not include a pair or two combinations of a sequence with itself. That would be stupid.

I made an attempt myself

public static IEnumerable<T> UnionOfIntersects<T>( this IEnumerable<IEnumerable<T>> source) { var pairs = from s1 in source from s2 in source select new { s1 , s2 }; var intersects = pairs .Where(p => p.s1 != p.s2) .Select(p => p.s1.Intersect(p.s2)); return intersects.SelectMany(i => i).Distinct(); }

but I am concerned that this might not be optimal, I think that it includes the intersections of pairs A, B and pairs B, A, which seem ineffective. I also think that there may be a more efficient way to join sets as they repeat.

I include a few input and output examples below:

 { { 1, 1, 2, 3, 4, 5, 7 }, { 5, 6, 7 }, { 2, 6, 7, 9 } , { 4 } }

returns

 { 2, 4, 5, 6, 7 }

and

 { { 1, 2, 3} } or { {} } or { }

returns

{ }

I am looking for the best combination of readability and potential performance.

EDIT

I did the initial testing of current answers, my code is here . The conclusion is below.

 Original valid:True DoomerOneLine valid:True DoomerSqlLike valid:True Svinja valid:True Adricadar valid:True Schmelter valid:True Original 100000 iterations in 82ms DoomerOneLine 100000 iterations in 58ms DoomerSqlLike 100000 iterations in 82ms Svinja 100000 iterations in 1039ms Adricadar 100000 iterations in 879ms Schmelter 100000 iterations in 9ms

At the moment, it seems that Tim Schmelter’s answer works better, at least by an order of magnitude.

+6

c # linq

Jodrell May 08 '15 at 8:05

source share

5 answers

 // init sequences var sequences = new int[][] { new int[] { 1, 2, 3, 4, 5, 7 }, new int[] { 5, 6, 7 }, new int[] { 2, 6, 7, 9 }, new int[] { 4 } };

One line method:

 var result = sequences .SelectMany(e => e.Distinct()) .GroupBy(e => e) .Where(e => e.Count() > 1) .Select(e => e.Key); // result is { 2 4 5 7 6 }

Sql-like method (with ordering):

 var result = ( from e in sequences.SelectMany(e => e.Distinct()) group e by e into g where g.Count() > 1 orderby g.Key select g.Key); // result is { 2 4 5 6 7 }

May be the fastest code (but not readable), O (N) complexity:

 var dic = new Dictionary<int, int>(); var subHash = new HashSet<int>(); int length = array.Length; for (int i = 0; i < length; i++) { subHash.Clear(); int subLength = array[i].Length; for (int j = 0; j < subLength; j++) { int n = array[i][j]; if (!subHash.Contains(n)) { int counter; if (dic.TryGetValue(n, out counter)) { // duplicate dic[n] = counter + 1; } else { // first occurance dic[n] = 1; } } else { // exclude duplucate in sub array subHash.Add(n); } } }

+4

General doomer May 08 '15 at 8:13

source share

You can skip already Intesected sequences, thus it will be a little faster.

 public static IEnumerable<T> UnionOfIntersects<T>(this IEnumerable<IEnumerable<T>> source) { var result = new List<T>(); var sequences = source.ToList(); for (int sequenceIdx = 0; sequenceIdx < sequences.Count(); sequenceIdx++) { var sequence = sequences[sequenceIdx]; for (int targetSequenceIdx = sequenceIdx + 1; targetSequenceIdx < sequences.Count; targetSequenceIdx++) { var targetSequence = sequences[targetSequenceIdx]; var intersections = sequence.Intersect(targetSequence); result.AddRange(intersections); } } return result.Distinct(); }

How it works?

 Input: {/*0*/ { 1, 2, 3, 4, 5, 7 } ,/*1*/ { 5, 6, 7 },/*2*/ { 2, 6, 7, 9 } , /*3*/{ 4 } }

Step 0: Intersect 0 with 1..3

Step 1: Intersection 1 with 2..3 (0 with 1 has already been crossed)

Step 2: Intersection 2 with 3 (0 with 2 and 1 with 2 have already been crossed)

Return: Individual items.

 Result: { 2, 4, 5, 6, 7 }

You can check it with the code below

 var lists = new List<List<int>> { new List<int> {1, 2, 3, 4, 5, 7}, new List<int> {5, 6, 7}, new List<int> {2, 6, 7, 9}, new List<int> {4 } }; var result = lists.UnionOfIntersects();

+1

adricadar May 08 '15 at 8:31

source share

This should be very close to optimal - how “readable” depends on your taste. In my opinion, this is also the most readable solution.

  var seenElements = new HashSet<T>(); var repeatedElements = new HashSet<T>(); foreach (var list in source) { foreach (var element in list.Distinct()) { if (seenElements.Contains(element)) { repeatedElements.Add(element); } else { seenElements.Add(element); } } } return repeatedElements;

+1

svinja May 08 '15 at 8:33

source share

This should nail:

 int[][] test = { new int[] { 1, 2, 3, 4, 5, 7 }, new int[] { 5, 6, 7 }, new int[] { 2, 6, 7, 9 }, new int[] { 4 } }; var result = test.SelectMany(a => a.Distinct()).GroupBy(x => x).Where(g => g.Count() > 1).Select(y => y.Key).ToList();

First, make sure that there are no duplicates in each sequence. Then you attach all the sequences to one sequence and look for duplicates, for example. here .

0

azt May 08 '15 at 8:25

source share

Tim schmelter · Accepted Answer · 2015-05-08T08:53:18+0000

You can try this approach, it can be more efficient, and also allows you to specify the minimum number of intersections and the comparator used:

 public static IEnumerable<T> UnionOfIntersects<T>(this IEnumerable<IEnumerable<T>> source , int minIntersectionCount , IEqualityComparer<T> comparer = null) { if (comparer == null) comparer = EqualityComparer<T>.Default; foreach (T item in source.SelectMany(s => s).Distinct(comparer)) { int containedInHowManySequences = 0; foreach (IEnumerable<T> seq in source) { bool contained = seq.Contains(item, comparer); if (contained) containedInHowManySequences++; if (containedInHowManySequences == minIntersectionCount) { yield return item; break; } } } }

Some explanatory words:

It lists all unique elements in all sequences. Because Distinct uses a set, this should be quite effective. This can help speed things up with many duplicates in all sequences.
The inner loop simply scans each sequence if it contains a unique element. Therefore, it uses Enumerable.Contains , which stops execution as soon as one element has been found (so duplicates are not a problem).
If the intersection counter reaches the intersection count with a minimum value, this element will be assigned and the next (unique) element will be checked.

The union of the intersections of 2-set sequence sequence combinations

How it works?

More articles: