Select one row from each group in a large data table based on a condition

I have a table where the key is repeated several times, and one to select only one row for each key, using the largest value of the other column.

This example shows the solution that I have at the moment:

N = 10 k = 2 DT = data.table(X = rep(1:N, each = k), Y = rnorm(k*N)) XY 1: 1 -1.37925206 2: 1 -0.53837461 3: 2 0.26516340 4: 2 -0.04643483 5: 3 0.40331424 6: 3 0.28667275 7: 4 -0.30342327 8: 4 -2.13143267 9: 5 2.11178673 10: 5 -0.98047230 11: 6 -0.27230783 12: 6 -0.79540934 13: 7 1.54264549 14: 7 0.40079650 15: 8 -0.98474297 16: 8 0.73179201 17: 9 -0.34590491 18: 9 -0.55897393 19: 10 0.97523187 20: 10 1.16924293 > DT[, .SD[Y == max(Y)], by = X] XY 1: 1 -0.5383746 2: 2 0.2651634 3: 3 0.4033142 4: 4 -0.3034233 5: 5 2.1117867 6: 6 -0.2723078 7: 7 1.5426455 8: 8 0.7317920 9: 9 -0.3459049 10: 10 1.1692429 

The problem is that for large data.tables this takes a very long time:

 N = 10000 k = 25 DT = data.table(X = rep(1:N, each = k), Y = rnorm(k*N)) system.time(DT[, .SD[Y == max(Y)], by = X]) user system elapsed 9.69 0.00 9.69 

My actual table is about 100 million rows ...

Can anyone suggest a more efficient solution?


Change - value of the specified key

The proposed solution works well, but you must use setkey or order a DT for it to work:

See an example without "each" in rep:

 N = 10 k = 2 DT = data.table(X = rep(1:N, k), Y = rnorm(k*N)) DT[DT[, Y == max(Y), by = X]$V1,] XY 1: 1 1.26925708 2: 4 -0.66625732 3: 5 0.41498548 4: 8 0.03531185 5: 9 0.30608380 6: 1 0.50308578 7: 4 0.19848227 8: 6 0.86458423 9: 8 0.69825500 10: 10 -0.38160503 
+5
source share
1 answer

It will be faster compared to .SD

  system.time({setkey(DT, X) DT[DT[,Y==max(Y), by=X]$V1,]}) # user system elapsed #0.016 0.000 0.016 

or

 system.time(DT[DT[, .I[Y==max(Y)], by=X]$V1]) # user system elapsed # 0.023 0.000 0.023 

If there are only two columns,

 system.time(DT[,list(Y=max(Y)), by=X]) # user system elapsed # 0.006 0.000 0.007 

Compared with

 system.time(DT[, .SD[Y == max(Y)], by = X] ) # user system elapsed # 2.946 0.006 2.962 

Based on comments by @Khashaa, @AnandaMahto, the CRAN version ( 1.9.4 ) gives a different result for the .SD method compared to the devel version ( 1.9.5 ) (which I used). You can get the same result for the "CRAN" version (from @Arun comments) by setting options

  options(datatable.auto.index=FALSE) 

NOTE. In the case of β€œlinks,” the solutions described here will return a few lines for each group (as indicated by @docendo discimus). My decisions are based on the "code" published by OP.

If there are "links", you can use unique with the by option (in case the number of columns > 2)

  setkey(DT,X) unique(DT[DT[,Y==max(Y), by=X]$V1,], by=c("X", "Y")) 

microbenchmarks

 library(microbenchmark) f1 <- function(){setkey(DT,X)[DT[, Y==max(Y), by=X]$V1,]} f2 <- function(){DT[DT[, .I[Y==max(Y)], by=X]$V1]} f3 <- function(){DT[, list(Y=max(Y)), by=X]} f4 <- function(){DT[, .SD[Y==max(Y)], by=X]} microbenchmark(f1(), f2(), f3(), f4(), unit='relative', times=20L) #Unit: relative # expr min lq mean median uq max neval # f1() 2.794435 2.733706 3.024097 2.756398 2.832654 6.697893 20 # f2() 4.302534 4.291715 4.535051 4.271834 4.342437 8.114811 20 # f3() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 20 # f4() 533.119480 522.069189 504.739719 507.494095 493.641512 466.862691 20 # cld # a # a # a # b 

data

 N = 10000 k = 25 set.seed(25) DT = data.table(X = rep(1:N, each = k), Y = rnorm(k*N)) 
+5
source

Source: https://habr.com/ru/post/1210736/


All Articles