Function data.tables and sweep

Using the data.table, which will be the fastest way to "sweep out" column selection statistics?

Starting with (significantly larger versions) DT

p <- 3 DT <- data.table(id=c("A","B","C"),x1=c(10,20,30),x2=c(20,30,10)) DT.totals <- DT[, list(id,total = x1+x2) ] 

I would like to move on to the next result of data.table, indexing the target columns (2: p) to skip the key:

  id x1 x2 [1,] A 0.33 0.67 [2,] B 0.40 0.60 [3,] C 0.75 0.25 
+6
source share
1 answer

I believe that something close to the following (which uses the relatively new set() function) will be the fastest:

 DT <- data.table(id = c("A","B","C"), x1 = c(10,20,30), x2 = c(20,30,10)) total <- DT[ , x1 + x2] rr <- seq_len(nrow(DT)) for(j in 2:3) set(DT, rr, j, DT[[j]]/total) DT # id x1 x2 # [1,] A 0.3333333 0.6666667 # [2,] B 0.4000000 0.6000000 # [3,] C 0.7500000 0.2500000 

FWIW, calls to set() take the following form:

 # set(x, i, j, value), where: # x is a data.table # i contains row indices # j contains column indices # value is the value to be assigned into the specified cells 

My suspicion about the relative speed of this, compared to other solutions, is based on this excerpt from the data.table of the NEW file , in the section on changes in version 1.8.0:

 o New function set(DT,i,j,value) allows fast assignment to elements of DT. Similar to := but avoids the overhead of [.data.table, so is much faster inside a loop. Less flexible than :=, but as flexible as matrix subassignment. Similar in spirit to setnames(), setcolorder(), setkey() and setattr(); ie, assigns by reference with no copy at all. M = matrix(1,nrow=100000,ncol=100) DF = as.data.frame(M) DT = as.data.table(M) system.time(for (i in 1:1000) DF[i,1L] <- i) # 591.000s system.time(for (i in 1:1000) DT[i,V1:=i]) # 1.158s system.time(for (i in 1:1000) M[i,1L] <- i) # 0.016s system.time(for (i in 1:1000) set(DT,i,1L,i)) # 0.027s 
+4
source

Source: https://habr.com/ru/post/912961/


All Articles