The following is an example of using mapply and your input and table_input :
#your code #input <- matrix( c(1000001,1000001,1000001,1000001,1000001,1000001,1000002,1000002,1000002,1000003,1000003,1000003,100001,100002,100003,100004,100005,100006,100002,100003,100007,100002,100003,100008,"2011-01-01","2011-01-02","2011-01-01","2011-01-04","2011-01-01","2011-01-02","2011-01-01","2011-01-04","2011-01-01","2011-01-02","2011-01-01","2011-01-04"), ncol=3) #colnames(input) <- c( "Product" , "Something" ,"Date") #input <- as.data.frame(input) #input$Date <- as.Date(input[,"Date"], "%Y-%m-%d") #Sort based on date, I want to leave out the entries with the oldest dates. #input <- input[ with( input, order(Date)), ] #Create number of items I want to select #table_input <- as.data.frame(table(input$Product)) #table_input$twentyfive <- ceiling( table_input$Freq*0.25 ) #function to "mapply" on "table_input" fun = function(p, d) { grep(p, input$Product)[1:d] } #subset "input" input[unlist(mapply(fun, table_input$Var1, table_input$twentyfive)),] Product Something Date 1 1000001 100001 2011-01-01 3 1000001 100003 2011-01-01 7 1000002 100002 2011-01-01 11 1000003 100003 2011-01-01
I, also called system.time and replicate for comparing mapply speed and alternatives from SimonO101 answer:
#SimonO101 code #require( plyr ) #ddply( input , .(Product) , function(x) x[ - c( 1 : ceiling( nrow(x) * 0.25 ) ) , ] ) #install.packages( "data.table" , repos="http://r-forge.r-project.org" ) #require( data.table ) #DT <- data.table( input ) #setkeyv( DT , c( "Product" , "Date" ) ) #DT[ , tail( .SD , -ceiling( nrow(.SD) * .25 ) ) , by = Product ] > system.time(replicate(10000, input[unlist(mapply(fun, table_input$Var1, table_input$twentyfive)),])) user system elapsed 5.29 0.00 5.29 > system.time(replicate(10000, ddply( input , .(Product) , function(x) x[ - c( 1 : ceiling( nrow(x) * 0.25 ) ) , ] ))) user system elapsed 43.48 0.03 44.04 > system.time(replicate(10000, DT[ , tail( .SD , -ceiling( nrow(.SD) * .25 ) ) , by = Product ] )) user system elapsed 34.30 0.01 34.50
BUT : alternatives to SimonO101 do not give the same thing as mapply , because I used mapply with the table_input you published; I do not know if this plays a role in comparison. In addition, the comparison may have been disabled by me. I just did it because you indicated the speed. I would really like @ SimonO101 to see this in case I say nonsense.