Count the number of valid observations (no NA) in pairs in the data frame

Say I have a data frame like this:

Df <- data.frame( V1 = c(1,2,3,NA,5), V2 = c(1,2,NA,4,5), V3 = c(NA,2,NA,4,NA) ) 

Now I want to calculate the number of valid observations for each combination of two variables. For this, I wrote a sharedcount function:

 sharedcount <- function(x,...){ nx <- names(x) alln <- combn(nx,2) out <- apply(alln,2, function(y)sum(complete.cases(x[y])) ) data.frame(t(alln),out) } 

This gives the result:

 > sharedcount(Df) X1 X2 out 1 V1 V2 3 2 V1 V3 1 3 V2 V3 2 

Everything is fine, but the function itself takes quite a lot of time on large data frames (600 variables and about 10,000 observations). I have a feeling that I am taking a lighter approach, especially since cor (..., use = 'pairwise') is still working much faster, while it needs to do something:

 > require(rbenchmark) > benchmark(sharedcount(TestDf),cor(TestDf,use='pairwise'), + columns=c('test','elapsed','relative'), + replications=1 + ) test elapsed relative 2 cor(TestDf, use = "pairwise") 0.25 1.0 1 sharedcount(TestDf) 1.90 7.6 

Any advice is appreciated.


Note Using the Vincent trick, I wrote a function that returns the same data frame. The code in my answer is below.

+4
source share
3 answers

Below is a little faster:

 x <- !is.na(Df) t(x) %*% x # test elapsed relative # cor(Df) 12.345 1.000000 # t(x) %*% x 20.736 1.679708 
+8
source

I thought Vincent looked very elegant, not to mention the fact that he was faster than my minor cycle, except that I seem to need the extraction step, which I added below. This is just an example of heavy overhead in the apply method when used with dataframes.

 shrcnt <- function(Df) {Comb <- t(combn(1:ncol(Df),2) ) shrd <- 1:nrow(Comb) for (i in seq_len(shrd)){ shrd[i] <- sum(complete.cases(Df[,Comb[i,1]], Df[,Comb[i,2]]))} return(shrd)} benchmark( shrcnt(Df), sharedcount(Df), {prs <- t(x) %*% x; prs[lower.tri(prs)]}, cor(Df,use='pairwise'), columns=c('test','elapsed','relative'), replications=100 ) #-------------- test elapsed relative 3 { 0.008 1.0 4 cor(Df, use = "pairwise") 0.020 2.5 2 sharedcount(Df) 0.092 11.5 1 shrcnt(Df) 0.036 4.5 
+3
source

Based on Vincent’s wonderful trick and lower.tri() additional lower.tri() clause, I came up with the following function, which gives me the same result (i.e. data frame) as my original one, and works much faster:

 sharedcount2 <- function(x,stringsAsFactors=FALSE,...){ counts <- crossprod(!is.na(x)) id <- lower.tri(counts) count <- counts[id] X1 <- colnames(counts)[col(counts)[id]] X2 <- rownames(counts)[row(counts)[id]] data.frame(X1,X2,count) } 

Note the use of crossprod() as this gives a slight improvement over %*% , but it does the same.

Timings:

 > benchmark(sharedcount(TestDf),sharedcount2(TestDf), + replications=5, + columns=c('test','replications','elapsed','relative')) test replications elapsed relative 1 sharedcount(TestDf) 5 10.00 90.90909 2 sharedcount2(TestDf) 5 0.11 1.00000 

Note. I put TestDf in the question, as I noticed that the timings differ depending on the size of the data frames. As shown here, an increase in time is much more dramatic than when compared to a small frame of data.

+2
source

Source: https://habr.com/ru/post/1398029/


All Articles