How to find equal columns in R?

Given the following:

a <- c(1,2,3) b <- c(1,2,3) c <- c(4,5,6) A <- cbind(a,b,c) 

I want to find which columns in are equal, for example, to my vector a.

My first attempt:

 > which(a==A) [1] 1 2 3 4 5 6 

Who did not do this. (Honestly, I don’t even understand what I did) Second attempt:

 a==A abc [1,] TRUE TRUE FALSE [2,] TRUE TRUE FALSE [3,] TRUE TRUE FALSE 

which is definitely a step in the right direction, but it seems to be expanded into a matrix. I would prefer something like one of the lines. How to compare a vector with columns and how to find columns in a matrix that are equal to a vector?

+4
source share
4 answers

If you add an extra line:

 > A abc [1,] 1 1 4 4 [2,] 2 2 5 2 [3,] 3 3 6 1 

Then you can see that this function is true:

 > hasCol=function(A,a){colSums(a==A)==nrow(A)} > A[,hasCol(A,a)] ab [1,] 1 1 [2,] 2 2 [3,] 3 3 

But the earlier version does not accept:

 > oopsCol=function(A,a){colSums(a==A)>0} > A[,oopsCol(A,a)] ab [1,] 1 1 4 [2,] 2 2 2 [3,] 3 3 1 

It returns a column of 4,2,1, because 2 corresponds to 2 in 1,2,3.

+7
source

Use identical . This is the R "scalar" comparison operator; it returns a single boolean, not a vector.

 apply(A, 2, identical, a) # abc # TRUE TRUE FALSE 

If A is a data frame in your real case, you'd better use sapply or vapply , because apply forces it to be entered into the matrix.

 d <- c("a", "b", "c") B <- data.frame(a, b, c, d) apply(B, 2, identical, a) # incorrect! # abcd # FALSE FALSE FALSE FALSE sapply(B, identical, a) # correct # abcd # TRUE TRUE FALSE FALSE 

But note that data.frame enforces character inputs into factors, unless you ask otherwise:

 sapply(B, identical, d) # incorrect # abcd # FALSE FALSE FALSE FALSE C <- data.frame(a, b, c, d, stringsAsFactors = FALSE) sapply(C, identical, d) # correct # abcd # FALSE FALSE FALSE TRUE 

Identity is also significantly faster than using all + == :

 library(microbenchmark) a <- 1:1000 b <- c(1:999, 1001) microbenchmark( all(a == b), identical(a, b)) # Unit: microseconds # expr min lq median uq max # 1 all(a == b) 8.053 8.149 8.2195 8.3295 17.355 # 2 identical(a, b) 1.082 1.182 1.2675 1.3435 3.635 
+8
source

Of course, there is a better solution, but the following works:

 > a <- c(1,2,3) > b <- c(1,2,3) > c <- c(4,5,6) > A <- cbind(a,b,c) > sapply(1:ncol(A), function(i) all(a==A[,i])) [1] TRUE TRUE FALSE 

And to get the indices:

 > which(sapply(1:ncol(A), function(i) all(a==A[,i]))) [1] 1 2 
+4
source
 colSums(a==A)==nrow(A) 

Recycling == effectively makes a matrix that has all columns equal to a and sizes equal to the values ​​of a . colSums sums each column; and TRUE behaves like 1 and FALSE as 0, columns equal to a will have a sum equal to the number of rows. We use this observation to finally reduce the response to a logical vector.

EDIT:

 library(microbenchmark) A<-rep(1:14,1000);c(7,2000)->dim(A) 1:7->a microbenchmark( apply(A,2,function(b) identical(a,b)), apply(A,2,function(b) all(a==b)), colSums(A==a)==nrow(A)) # Unit: microseconds # expr min lq median # 1 apply(A, 2, function(b) all(a == b)) 9446.210 9825.6465 10278.335 # 2 apply(A, 2, function(b) identical(a, b)) 9324.203 9915.7935 10314.833 # 3 colSums(A == a) == nrow(A) 120.252 121.5885 140.185 # uq max # 1 10648.7820 30588.765 # 2 10868.5970 13905.095 # 3 141.7035 162.858 
-1
source

Source: https://habr.com/ru/post/1440730/


All Articles