Unique.data.table selects the last row instead of the first

calling unique on the keyboard data.table you will have unique lines for each group. In the case of duplicate rows, the first will be accepted. When I need to take the last one instead (in general, the last temporary transaction), I use .SD[.N]

 library(data.table) library(microbenchmark) dt <- data.table(id=sample(letters, 10000, T), var=rnorm(10000), key="id") microbenchmark(unique(dt), dt[, .SD[.N], by=id]) Unit: microseconds expr min lq median uq max neval unique(dt) 570.882 586.1155 595.8975 608.406 3209.122 100 dt[, .SD[.N], by = id] 6532.739 6637.7745 6694.3820 6776.968 208264.433 100 

Do you know a faster way to do the same?

+4
source share
3 answers

Create a data.table that contains unique combinations of key variables, then join using mult = 'last'

Using .SD convenient but slow. You could use .I if you want.

 dtu <- unique(dt)[,key(dt), with = FALSE] dt[dtu, mult = 'last'] 

or

  dt[ dt[, .I[.N], by = key(dt)]$V1] 
+7
source

From data.table v1.9.4 you can use fromLast = TRUE .

 microbenchmark(unique(dt, by = "id"), dt[, .SD[.N], by=id], unique(dt, by = "id", fromLast = TRUE)) Unit: microseconds expr min lq mean median uq max neval cld unique(dt, by = "id") 333.978 355.1900 406.1585 371.1360 393.4015 3203.769 100 a dt[, .SD[.N], by = id] 519.320 541.4345 580.2176 553.6200 563.5490 2690.167 100 b unique(dt, by = "id", fromLast = TRUE) 338.190 366.4725 430.1296 380.9145 400.7730 4774.663 100 a 
+3
source

Here is another option, although it looks a bit slower than the answers from @mnel, at least for an example.

 dt[,list(var,RN=.N:1),by=id][RN==1L] 
0
source

Source: https://habr.com/ru/post/1490801/


All Articles