Efficiently select the top row count for each unique column value in data.frame

Question

Efficiently select the top row count for each unique column value in data.frame

I am trying to take a subset of a data frame based on the value of a value. This is best explained in the example below. This question has a lot to do with: Choosing the top finite number of rows for each unique column value in the data glory in R However, I want to change the number of elements selected by the head () command.

#Sample data input <- matrix( c(1000001,1000001,1000001,1000001,1000001,1000001,1000002,1000002,1000002,1000003,1000003,1000003,100001,100002,100003,100004,100005,100006,100002,100003,100007,100002,100003,100008,"2011-01-01","2011-01-02","2011-01-01","2011-01-04","2011-01-01","2011-01-02","2011-01-01","2011-01-04","2011-01-01","2011-01-02","2011-01-01","2011-01-04"), ncol=3) colnames(input) <- c( "Product" , "Something" ,"Date") input <- as.data.frame(input) input$Date <- as.Date(input[,"Date"], "%Y-%m-%d") #Sort based on date, I want to leave out the entries with the oldest dates. input <- input[ with( input, order(Date)), ] #Create number of items I want to select table_input <- as.data.frame(table(input$Product)) table_input$twentyfive <- ceiling( table_input$Freq*0.25 ) #This next part is a very time consuming method (Have 2 mln rows, 90k different products) first <- TRUE for( i in table_input$Var1 ) { data_selected <- input[input$Product == i,] number <- table_input[table_input$Var1 == i ,]$twentyfive head <- head( data_selected, number) if( first == FALSE) { output <- rbind(output, head) } else { output <- head } first <- FALSE }

Hope someone knows a better, more efficient way. I tried to use the split function from the answer here: Selecting the top final number of rows for each unique column value in the data glory in R to split into products and then try iterating over them and select head (). However, the split function always runs out of memory (cannot allocate ..)

 input_split <- split(input, input$Product) #Works here, but not i my problem.

So, in the end, my problem is that I want you to choose a different amount of each unique product. So, here are 2 items from 1000001 and 1 item from 1000002 and 1000003.

+8

r dataframe

Freddy Oct 17 '13 at 10:55 on

source share

2 answers

The following is an example of using mapply and your input and table_input :

  #your code #input <- matrix( c(1000001,1000001,1000001,1000001,1000001,1000001,1000002,1000002,1000002,1000003,1000003,1000003,100001,100002,100003,100004,100005,100006,100002,100003,100007,100002,100003,100008,"2011-01-01","2011-01-02","2011-01-01","2011-01-04","2011-01-01","2011-01-02","2011-01-01","2011-01-04","2011-01-01","2011-01-02","2011-01-01","2011-01-04"), ncol=3) #colnames(input) <- c( "Product" , "Something" ,"Date") #input <- as.data.frame(input) #input$Date <- as.Date(input[,"Date"], "%Y-%m-%d") #Sort based on date, I want to leave out the entries with the oldest dates. #input <- input[ with( input, order(Date)), ] #Create number of items I want to select #table_input <- as.data.frame(table(input$Product)) #table_input$twentyfive <- ceiling( table_input$Freq*0.25 ) #function to "mapply" on "table_input" fun = function(p, d) { grep(p, input$Product)[1:d] } #subset "input" input[unlist(mapply(fun, table_input$Var1, table_input$twentyfive)),] Product Something Date 1 1000001 100001 2011-01-01 3 1000001 100003 2011-01-01 7 1000002 100002 2011-01-01 11 1000003 100003 2011-01-01

I, also called system.time and replicate for comparing mapply speed and alternatives from SimonO101 answer:

  #SimonO101 code #require( plyr ) #ddply( input , .(Product) , function(x) x[ - c( 1 : ceiling( nrow(x) * 0.25 ) ) , ] ) #install.packages( "data.table" , repos="http://r-forge.r-project.org" ) #require( data.table ) #DT <- data.table( input ) #setkeyv( DT , c( "Product" , "Date" ) ) #DT[ , tail( .SD , -ceiling( nrow(.SD) * .25 ) ) , by = Product ] > system.time(replicate(10000, input[unlist(mapply(fun, table_input$Var1, table_input$twentyfive)),])) user system elapsed 5.29 0.00 5.29 > system.time(replicate(10000, ddply( input , .(Product) , function(x) x[ - c( 1 : ceiling( nrow(x) * 0.25 ) ) , ] ))) user system elapsed 43.48 0.03 44.04 > system.time(replicate(10000, DT[ , tail( .SD , -ceiling( nrow(.SD) * .25 ) ) , by = Product ] )) user system elapsed 34.30 0.01 34.50

BUT : alternatives to SimonO101 do not give the same thing as mapply , because I used mapply with the table_input you published; I do not know if this plays a role in comparison. In addition, the comparison may have been disabled by me. I just did it because you indicated the speed. I would really like @ SimonO101 to see this in case I say nonsense.

+2

alexis_laz Oct 17 '13 at 14:17

source share

Simon O'Hanlon · Accepted Answer · 2013-10-17 11:49

Two spring solutions. plyr::ddply is for your needs, but using data.table will be faster than waaaaaay.

You want data.frame split it into pieces, delete all the bottom 25% of the lines of each fragment, which are sorted by date and recombined into data.frame . This can be done in one simple line ...

 require( plyr ) ddply( input , .(Product) , function(x) x[ - c( 1 : ceiling( nrow(x) * 0.25 ) ) , ] ) # Product Something Date #1 1000001 100005 2011-01-01 #2 1000001 100002 2011-01-02 #3 1000001 100006 2011-01-02 #4 1000001 100004 2011-01-04 #5 1000002 100007 2011-01-01 #6 1000002 100003 2011-01-04 #7 1000003 100002 2011-01-02 #8 1000003 100008 2011-01-04

`data.table` solution

For data.table you will need the latest development version from r-forge (due to the lack of a negative index in the CRAN version of data.table yet). Make sure you follow the install.package call to get the latest version ...

 install.packages( "data.table" , repos="http://r-forge.r-project.org" ) require( data.table ) DT <- data.table( input ) # Sort by Product then Date very quickly setkeyv( DT , c( "Product" , "Date" ) ) # Return the bottom 75% of rows (ie not the earliest) DT[ , tail( .SD , -ceiling( nrow(.SD) * .25 ) ) , by = Product ] # Product Something Date #1: 1000001 100005 2011-01-01 #2: 1000001 100002 2011-01-02 #3: 1000001 100006 2011-01-02 #4: 1000001 100004 2011-01-04 #5: 1000002 100007 2011-01-01 #6: 1000002 100003 2011-01-04 #7: 1000003 100002 2011-01-02 #8: 1000003 100008 2011-01-04

Best way to use `data.table`

You could easily do this (so you don't need the development version of data.table ) ...

 DT[ , .SD[ -c( 1:ceiling( .25 * .N ) ) ] , by = Product ]

And you can also use lapply in the j argument (I was worried about my use of .SD ), and this works after ~ 14 seconds on data.table of 2e6 rows with 90,000 products (groups) ...

 set.seed(1) Product <- sample( 1:9e5 , 2e6 , repl = TRUE ) dates <- sample( 1:20 , 2e6 , repl = TRUE ) Date <- as.Date( Sys.Date() + dates ) DT <- data.table( Product = Product , Date = Date ) system.time( { setkeyv( DT , c( "Product" , "Date" ) ); DT[ , lapply( .SD , `[` , -c( 1:ceiling( .25 * .N ) ) ) , by = Product ] } ) # user system elapsed # 14.65 0.03 14.74

Update: the best way to use `data.table` !

So, thanks to @Arun (who is now the author of the data.table package), we have a better way to use data.table , which should use .I , which is an integer vector of all row indices, a subset of [ , removing the first 25% of the record using -(1:ceiling(.N*.25)) , and then doing a subset using these row indices to get the final table. This is about 4-5 times faster than the my .SD method above. Amazing stuff!

 system.time( DT[ DT[, .I[-(1:ceiling(.N*.25))] , by = Product]$V1] ) user system elapsed 3.02 0.00 3.03

Efficiently select the top row count for each unique column value in data.frame

data.table solution

Best way to use data.table

Update: the best way to use data.table !

More articles:

`data.table` solution

Best way to use `data.table`

Update: the best way to use `data.table` !