Delete vectors that are subsets of other vectors in the list

I have a list with vectors of different lengths, for example:

a=c(12345,12367,91670,87276,92865) b=c(12345,87276,89250) c=c(12367,91670) d=c(23753,82575,91475,10957,92865,24311) mylist=list(a,b,c,d) mylist # [[1]] # [1] 12345 12367 91670 87276 92865 # # [[2]] # [1] 12345 87276 89250 # # [[3]] # [1] 12367 91670 # # [[4]] # [1] 23753 82575 91475 10957 92865 24311 

my question is how can I remove the vectors of this list that are a subset of another vector of the same list. that is, in this case, how can I delete the third object of the list, which is a subset of the 1st object ??

+5
source share
5 answers

This gives a new list without items that are subsets of others ...

 newlist <- mylist[!sapply(seq_along(mylist), function(i) max(sapply(mylist[-i],function(L) all(mylist[[i]] %in% L))))] 
+2
source

It may be quite inefficient, but if your list is not so big, it may work

 find_nested <- function(mylist) { mm <- sapply(mylist, function(x) sapply(mylist, function(y) all(x %in%y))) diag(mm) <- FALSE apply(mm,2,any) } 

This can tell you which vectors are subsets of other vectors. He does this by comparing each vector with any other vector.

 find_nested(mylist) # [1] FALSE FALSE TRUE FALSE 

So, we see that the third element is contained in another list.

+3
source
 which(t(sapply(seq_along(mylist), function(i) sapply(mylist[-i], function(a) all(unlist(mylist[i]) %in% a)))), arr.ind = TRUE) # row col #[1,] 3 1 #Suggests that 3rd item is contained within 1st item 
+2
source

Here is another method. It is also not very effective, but will return the position of nested lists.

 # get ordered pairwise combinations of list positions combos <- combn(1:4, 2) combos <- cbind(combos, combos[2:1,]) 

Ordering is very important because the comparison is not symmetrical. Now draw these combinations and compare them using intersect .

 combos[1, sapply(seq_len(ncol(combos)), function(i) setequal(intersect(mylist[[combos[1,i]]], mylist[[combos[2,i]]]), mylist[[combos[1,i]]]))] [1] 3 

rewriting the last line to use mapply rather than sapply can improve readability.

 combos[1, mapply(function(x, y) setequal(intersect(mylist[[x]], mylist[[y]]), mylist[[x]]), combos[1,], combos[2,])] [1] 3 
+1
source

An alternative (from the list) for elements is the "by value" table:

 table(values = unlist(mylist), elt = rep(seq_along(mylist), lengths(mylist))) # elt #values 1 2 3 4 # 10957 0 0 0 1 # 12345 1 1 0 0 # 12367 1 0 1 0 # 23753 0 0 0 1 # 24311 0 0 0 1 # 82575 0 0 0 1 # 87276 1 1 0 0 # 89250 0 1 0 0 # 91475 0 0 0 1 # 91670 1 0 1 0 # 92865 1 0 0 1 

It can very easily consume a lot of memory so that we can pursue a rare alternative:

 l = unlist(mylist) ul = unique(l) tab = sparseMatrix(x = TRUE, i = match(l, ul), j = rep(seq_along(mylist), lengths(mylist)), dimnames = list(ul, sprintf("elt_%d", seq_along(mylist)))) tab #11 x 4 sparse Matrix of class "lgCMatrix" # elt_1 elt_2 elt_3 elt_4 #12345 | | . . #12367 | . | . #91670 | . | . #87276 | | . . #92865 | . . | #89250 . | . . #23753 . . . | #82575 . . . | #91475 . . . | #10957 . . . | #24311 . . . | 

Then, to find which element is a subset of which:

 subsets = lengths(mylist) == crossprod(tab) subsets #4 x 4 sparse Matrix of class "lgCMatrix" # elt_1 elt_2 elt_3 elt_4 #elt_1 | . . . #elt_2 . | . . #elt_3 | . | . #elt_4 . . . | 

where here each element is a subset of itself ... and 3 is a subset of 1. To get the information we need, we could use:

 subset(summary(subsets), i != j)[c("i", "j")] # # ij #2 3 1 

Or, to avoid including diag indices and their subsets later, we could manipulate an existing structure:

 dp = diff( subsets@p ) j = rep(0:(length(dp) - 1), dp) wh = subsets@i != j cbind(subset = subsets@i [wh], of = j[wh]) + 1L # subset of #[1,] 3 1 

In both of the latter cases, the unique first column shows which elements are subsets of the other and can be used for [ "mylist".

+1
source

Source: https://habr.com/ru/post/1266034/


All Articles