How to call a function that returns multiple rows and columns in a data table.

I want to call a function inside data.table that calculates a set of summary statistics as follows:

summ.stats <- function(vec) { list( Min = min(vec), Mean = mean(vec), SD = sd(vec), Median = median(vec), Max = max(vec)) } 

and I want to name it in j data.table :

 DT <- data.table(a=c(1,2,3,1,2,3),b=c(1,4,3,2,1,4),c=c(2,3,4,5,2,1)) DT[, summ.stats(b), by=a] 

This is normal and I get:

  a Min Mean SD Median Max 1: 1 1 1.5 0.7071068 1.5 2 2: 2 1 2.5 2.1213203 2.5 4 3: 3 3 3.5 0.7071068 3.5 4 

But I'm interested in passing a few variables to summ.stats. For instance:

 DT[, summ.stats(b, c), by=a] 

I want to get something like:

  a Var Min Mean SD Median Max 1: 1 b 1 1.5 0.7071068 1.5 2 2: 2 b 1 2.5 2.1213203 2.5 4 3: 3 b 3 3.5 0.7071068 3.5 4 4: 1 c 2 3.5 2.1213203 3.5 5 5: 2 c 2 2.5 0.7071068 2.5 3 6: 3 c 1 2.5 2.1213203 2.5 4 

What is the best way to do this?

+4
source share
2 answers

Alternatively, you can change your function as follows:

 summ.stats <- function(vec) { list( Var = names(vec), Min = sapply(vec, min), Mean = sapply(vec, mean), SD = sapply(vec, sd), Median = sapply(vec, median), Max = sapply(vec, max)) } DT[, summ.stats(.SD), by=a] # no need for as.list(.SD) as Roger mentions a Var Min Mean SD Median Max 1: 1 b 1 1.5 0.7071068 1.5 2 2: 1 c 2 3.5 2.1213203 3.5 5 3: 2 b 1 2.5 2.1213203 2.5 4 4: 2 c 2 2.5 0.7071068 2.5 3 5: 3 b 3 3.5 0.7071068 3.5 4 6: 3 c 1 2.5 2.1213203 2.5 4 
+5
source

Without explicit rebuilding into a long form, you can do something like

 rbindlist(lapply(c('b','c'), function(x) data.table(var = x, DT[,summ.stats(get(x)),by=a]))) # var a Min Mean SD Median Max # 1: b 1 1 1.5 0.7071068 1.5 2 # 2: b 2 1 2.5 2.1213203 2.5 4 # 3: b 3 3 3.5 0.7071068 3.5 4 # 4: c 1 2 3.5 2.1213203 3.5 5 # 5: c 2 2 2.5 0.7071068 2.5 3 # 6: c 3 1 2.5 2.1213203 2.5 4 

If you reshape data for a long form

 reshape(DT, direction = 'long', varying = list(value = c('b','c')), times = c('b','c'))[,summ.stats(b), by = list(a, Var = time)] 

will also work.


Less efficiently, you can use ldply from plyr, with a little redefinition of the function

 summ.stats2 <- function(vec) { data.table( Min = min(vec), Mean = mean(vec), SD = sd(vec), Median = median(vec), Max = max(vec)) } library(plyr) DT[, ldply(lapply(.SD, summ.stats2)),by =a] 
+3
source

Source: https://habr.com/ru/post/1493976/


All Articles