Select one row from each group in a large data table based on a condition

Question

Select one row from each group in a large data table based on a condition

I have a table where the key is repeated several times, and one to select only one row for each key, using the largest value of the other column.

This example shows the solution that I have at the moment:

N = 10 k = 2 DT = data.table(X = rep(1:N, each = k), Y = rnorm(k*N)) XY 1: 1 -1.37925206 2: 1 -0.53837461 3: 2 0.26516340 4: 2 -0.04643483 5: 3 0.40331424 6: 3 0.28667275 7: 4 -0.30342327 8: 4 -2.13143267 9: 5 2.11178673 10: 5 -0.98047230 11: 6 -0.27230783 12: 6 -0.79540934 13: 7 1.54264549 14: 7 0.40079650 15: 8 -0.98474297 16: 8 0.73179201 17: 9 -0.34590491 18: 9 -0.55897393 19: 10 0.97523187 20: 10 1.16924293 > DT[, .SD[Y == max(Y)], by = X] XY 1: 1 -0.5383746 2: 2 0.2651634 3: 3 0.4033142 4: 4 -0.3034233 5: 5 2.1117867 6: 6 -0.2723078 7: 7 1.5426455 8: 8 0.7317920 9: 9 -0.3459049 10: 10 1.1692429

The problem is that for large data.tables this takes a very long time:

 N = 10000 k = 25 DT = data.table(X = rep(1:N, each = k), Y = rnorm(k*N)) system.time(DT[, .SD[Y == max(Y)], by = X]) user system elapsed 9.69 0.00 9.69

My actual table is about 100 million rows ...

Can anyone suggest a more efficient solution?

Change - value of the specified key

The proposed solution works well, but you must use setkey or order a DT for it to work:

See an example without "each" in rep:

 N = 10 k = 2 DT = data.table(X = rep(1:N, k), Y = rnorm(k*N)) DT[DT[, Y == max(Y), by = X]$V1,] XY 1: 1 1.26925708 2: 4 -0.66625732 3: 5 0.41498548 4: 8 0.03531185 5: 9 0.30608380 6: 1 0.50308578 7: 4 0.19848227 8: 6 0.86458423 9: 8 0.69825500 10: 10 -0.38160503

+5

r data.table

Corone Jan 9 '15 at 14:37

source share

1 answer

akrun · Accepted Answer · 2015-01-09T14:46:16+0000

It will be faster compared to .SD

  system.time({setkey(DT, X) DT[DT[,Y==max(Y), by=X]$V1,]}) # user system elapsed #0.016 0.000 0.016

or

 system.time(DT[DT[, .I[Y==max(Y)], by=X]$V1]) # user system elapsed # 0.023 0.000 0.023

If there are only two columns,

 system.time(DT[,list(Y=max(Y)), by=X]) # user system elapsed # 0.006 0.000 0.007

Compared with

 system.time(DT[, .SD[Y == max(Y)], by = X] ) # user system elapsed # 2.946 0.006 2.962

Based on comments by @Khashaa, @AnandaMahto, the CRAN version ( 1.9.4 ) gives a different result for the .SD method compared to the devel version ( 1.9.5 ) (which I used). You can get the same result for the "CRAN" version (from @Arun comments) by setting options

  options(datatable.auto.index=FALSE)

NOTE. In the case of “links,” the solutions described here will return a few lines for each group (as indicated by @docendo discimus). My decisions are based on the "code" published by OP.

If there are "links", you can use unique with the by option (in case the number of columns > 2)

  setkey(DT,X) unique(DT[DT[,Y==max(Y), by=X]$V1,], by=c("X", "Y"))

microbenchmarks

 library(microbenchmark) f1 <- function(){setkey(DT,X)[DT[, Y==max(Y), by=X]$V1,]} f2 <- function(){DT[DT[, .I[Y==max(Y)], by=X]$V1]} f3 <- function(){DT[, list(Y=max(Y)), by=X]} f4 <- function(){DT[, .SD[Y==max(Y)], by=X]} microbenchmark(f1(), f2(), f3(), f4(), unit='relative', times=20L) #Unit: relative # expr min lq mean median uq max neval # f1() 2.794435 2.733706 3.024097 2.756398 2.832654 6.697893 20 # f2() 4.302534 4.291715 4.535051 4.271834 4.342437 8.114811 20 # f3() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 20 # f4() 533.119480 522.069189 504.739719 507.494095 493.641512 466.862691 20 # cld # a # a # a # b

data

 N = 10000 k = 25 set.seed(25) DT = data.table(X = rep(1:N, each = k), Y = rnorm(k*N))

Select one row from each group in a large data table based on a condition

microbenchmarks

data

More articles: