I want to take a bunch of descriptive statistics grouped by several hundred grouping vars. I know from How to group data.table by multiple columns? so that I can use list () in the grouping parameter if I need a stat for the combination of the vars grouping. In my case, I want the average for each level Y than the average for each level Z
set.seed(007)
DF <- data.frame(X=1:50000, Y=sample(c(0,1), 50000, TRUE), Z=sample(0:5, 50000, TRUE))
library(data.table)
DT <- data.table(DF)
DT[, mean(X), by=list(Y, Z)]
DT[, mean(X), by=c("Y", "Z")]
out <- lapply( c( "Y","Z") , FUN= function(K){ DT[, mean(X), by=get(K)]})
out <- do.call( rbind, out )
I have 100 million records and 400+ grouping vars, so something is needed - somewhat efficiently. The lapply option adds up to several days of additional processing time.
options( digits=15 )
start.time <- Sys.time()
out <- lapply( c( "Y","Z") , FUN= function(K){ DT[, mean(X), by=get(K)]})
end.time <- Sys.time()
time.taken <- end.time - start.time
start.time <- Sys.time()
DT[, mean(X), by=c("Y")]
DT[, mean(X), by=c("Z")]
end.time <- Sys.time()
time.taken2 <- end.time - start.time
time.taken - time.taken2
source
share