Create a sequence number for the block of records in the R-data frame

I have a fairly large data set (by my standards), and I want to create an ordinal number for record blocks. I can use the plyr package, but the runtime is very slow. The code below replicates a dimension frame of comparable size.

## simulate an example of the size of a normal data frame N <- 30000 id <- sample(1:17000, N, replace=T) term <- as.character(sample(c(9:12), N, replace=T)) date <- sample(seq(as.Date("2012-08-01"), Sys.Date(), by="day"), N, replace=T) char <- data.frame(matrix(sample(LETTERS, N*50, replace=T), N, 50)) val <- data.frame(matrix(rnorm(N*50), N, 50)) df <- data.frame(id, term, date, char, val, stringsAsFactors=F) dim(df) 

In fact, this is a little less than what I'm working with, since the values ​​are usually larger ... but it's close enough.

Here is the runtime on my machine:

 > system.time(test.plyr <- ddply(df, + .(id, term), + summarise, + seqnum = 1:length(id), + .progress="text")) |===============================================================================================| 100% user system elapsed 63.52 0.03 63.85 

Is there a β€œbetter” way to do this? Sorry, I'm on a Windows machine.

Thanks in advance.

EDIT: Data.table is extremely fast, but I cannot correctly calculate the sequence numbers. This is what my version of ddply created. Most of them have only one entry in the group, but some of them have 2 lines, 3 lines, etc.

 > with(test.plyr, table(seqnum)) seqnum 1 2 3 4 5 24272 4950 681 88 9 

And using the data table as shown below, the same approach gives:

 > with(test.dt, table(V1)) V1 1 24272 
+4
source share
1 answer

Use data.table

 dt = data.table(df) test.dt = dt[,.N,"id,term"] 

Here is a time comparison. I used N = 3000 and replaced 17000 with 1700 when creating the dataset

 f_plyr <- function(){ test.plyr <- ddply(df, .(id, term), summarise, seqnum = 1:length(id), .progress="text") } f_dt <- function(){ dt = data.table(df) test.dt = dt[,.N,"id,term"] } library(rbenchmark) benchmark(f_plyr(), f_dt(), replications = 10, columns = c("test", "replications", "elapsed", "relative")) 

data.table speeds up work by 170 times

 test replications elapsed relative 2 f_dt() 10 0.779 1.000 1 f_plyr() 10 132.572 170.182 

Also check out Hadley's latest work on dplyr . I would not be surprised if dplyr provides additional acceleration, given that most of the code is processed in C.

UPDATE: Edited code, changing length(id) to .N according to Matt comments.

+5
source

Source: https://habr.com/ru/post/1446466/


All Articles