Subset optimization with data.table in a loop

I have a basic question on how to optimize the following code. This is a very shortened version of my code. Basically, I have a large data table (> 50M rows), and I would very often like to multiply the data (say 10,000 times) and run some function in a subset (obviously more complex than the one shown in the example below, i.e. I need all the columns of the subset, and the function returns a new data table.). I just chose the middle one to make the example simple.

dt <- data.table(a=sample(letters, 1000000,replace=T),b=sample(1:100000))

mm <- list()

foo <- function(x) mean(x$b)

for(i in 1:1000)
{
  mm[[i]] <-  foo(dt[a %in% sample(letters,5)])
}

Obviously, this is not the fastest way to program even this minimal example (setting keys, etc.).

I am wondering, however, how to optimize the for loop. I meant to create indexes for subsets and then use data.table dt[,foo(.SD),by=subset_ID], but I'm not sure how to do this, as I take a selection with a replacement (multiple group identifiers). Any ideas based on data.table would be appreciated (e.g. how to remove a loop?).

+4
source share
1 answer

I had to create indexes for subsets and then use data.table dt[,foo(.SD),by=subset_ID], but I'm not sure how to do this, since I take a selection with a replacement (several group identifiers).

When merging, you can have overlapping groups:

# convert to numeric
dt[, b := as.numeric(b)]

# make samples
set.seed(1)
mDT = setDT(melt(replicate(1000, sample(letters,5))))
setnames(mDT, c("seqi", "g", "a"))

# compute function on each sample
dt[mDT, on=.(a), allow.cartesian=TRUE, .(g, b)][, .(res = mean(b)), by=g]

which gives

         g      res
   1:    1 50017.85
   2:    2 49980.03
   3:    3 50093.80
   4:    4 50087.67
   5:    5 49990.83
  ---              
 996:  996 50013.11
 997:  997 50095.43
 998:  998 49913.61
 999:  999 50058.44
1000: 1000 49909.36

To confirm this, you can check, for example,

dt[a %in% mDT[g == 1, a], mean(b)]
# [1] 50017.85

, ( ), RAM-.

mean, ; . ?GForce, b .

( , , ), , .

, , :

dtagg = dt[, .(.N, sumb = sum(b)), by=a]

dtagg[mDT, on=.(a), .(g, sumb, N)][, lapply(.SD, sum), by=g][, .(g, res = sumb/N)]

         g      res
   1:    1 50017.85
   2:    2 49980.03
   3:    3 50093.80
   4:    4 50087.67
   5:    5 49990.83
  ---              
 996:  996 50013.11
 997:  997 50095.43
 998:  998 49913.61
 999:  999 50058.44
1000: 1000 49909.36

allow.cartesian , mDT dtagg. , , :

  • 13,7 . OP
  • 11.4 .
  • 0,02 . -
+3

Source: https://habr.com/ru/post/1689275/


All Articles