Recursive assignment in data.table

Is it data.table to recursively assign multiple columns in data.table ? By recursive, I mean that the following assignment depends on the previous assignment:

 library(data.table) DT = data.table(id=rep(LETTERS[1:4], each=2), val=1:8) DT[, c("cumsum", "cumsumofcumsum"):=list(cumsum(val), cumsum(cumsum)), by=id] # Error in `[.data.table`(DT, , `:=`(c("cumsum", "cumsumofcumsum"), list(cumsum(val), : # cannot coerce type 'builtin' to vector of type 'double' 

Of course, you can perform assignments separately, but I think that overhead (such as grouping) will not be shared between operations:

 DT = data.table(id=rep(LETTERS[1:4], each=2), val=1:8) DT[, c("cumsum"):=cumsum(val), by=id] DT[, c("cumsumofcumsum"):=cumsum(cumsum), by=id] DT # id val cumsum cumsumofcumsum # 1: A 1 1 1 # 2: A 2 3 4 # 3: B 3 3 3 # 4: B 4 7 10 # 5: C 5 5 5 # 6: C 6 11 16 # 7: D 7 7 7 # 8: D 8 15 22 
+5
source share
1 answer

You can use a temporary variable and use it again for other variables:

 DT[, c("cumsum", "cumsumofcumsum"):={ x <- cumsum(val) list(x, cumsum(x)) }, by=id] 

Of course, you can use dplyr and use your data table as a backend, but I'm not sure that you will get the same performance as the pure data.table method:

 library(dplyr) DT %>% group_by(id ) %>% mutate( cum1 = cumsum(val), cum2 = cumsum(cum1) ) 

EDIT add some benches:

The clean data.table solution is 5 times faster than dplyr. I think the view in dplyr behind the scenes may explain this difference.

 f_dt <- function(){ DT[, c("cumsum", "cumsumofcumsum"):={ x <- as.numeric(cumsum(val)) list(x, cumsum(x)) }, by=id] } f_dplyr <- function(){ DT %>% group_by(id ) %>% mutate( cum1 = as.numeric(cumsum(val)), cum2 = cumsum(cum1) ) } library(microbenchmark) microbenchmark(f_dt(),f_dplyr(),times = 100) expr min lq median uq max neval f_dt() 2.580121 2.97114 3.256156 4.318658 13.49149 100 f_dplyr() 10.792662 14.09490 15.909856 19.593819 159.80626 100 
+5
source

Source: https://habr.com/ru/post/1203997/


All Articles