Say I have a data frame like this:
Df <- data.frame( V1 = c(1,2,3,NA,5), V2 = c(1,2,NA,4,5), V3 = c(NA,2,NA,4,NA) )
Now I want to calculate the number of valid observations for each combination of two variables. For this, I wrote a sharedcount function:
sharedcount <- function(x,...){ nx <- names(x) alln <- combn(nx,2) out <- apply(alln,2, function(y)sum(complete.cases(x[y])) ) data.frame(t(alln),out) }
This gives the result:
> sharedcount(Df) X1 X2 out 1 V1 V2 3 2 V1 V3 1 3 V2 V3 2
Everything is fine, but the function itself takes quite a lot of time on large data frames (600 variables and about 10,000 observations). I have a feeling that I am taking a lighter approach, especially since cor (..., use = 'pairwise') is still working much faster, while it needs to do something:
> require(rbenchmark) > benchmark(sharedcount(TestDf),cor(TestDf,use='pairwise'), + columns=c('test','elapsed','relative'), + replications=1 + ) test elapsed relative 2 cor(TestDf, use = "pairwise") 0.25 1.0 1 sharedcount(TestDf) 1.90 7.6
Any advice is appreciated.
Note Using the Vincent trick, I wrote a function that returns the same data frame. The code in my answer is below.