Retrieve the most recent entry, under a certain condition

Question

Retrieve the most recent entry, under a certain condition

I try my best to do the following:

Dataset example:

belongID uniqID Time Rating 1 101 5 0 1 102 4 0 2 103 4 0 2 104 3 0 2 105 2 5 3 106 4 2 3 107 5 0 3 108 5 1

The problem is this: I would like to retrieve the most recent record (the highest value for time) for each identifier if this rating is 0. If the rating of the most recent record is 0. I want the first record to have a rating (not the highest rating, only the first value with a rating that is not equal to zero). If all other entries are also zero, select the most recent one.

The end result should be:

  belongID uniqID Time Rating 1 101 5 0 2 105 2 5 3 108 5 1

The dataset is quite large and ordered by owner. It is not ordered by time, so later entries may appear after older entries with the same membership identifier.

Without the "0 Rating" limit, I used the following function to calculate the most recent entry:

 >uniqueMax <- function(m, belongID = 1, time = 3) { t( vapply( split(1:nrow(m), m[,belongID]), function(i, x, time) x[i, , drop=FALSE][which.max(x[i,time]),], m[1,], x=m, time=time ) ) }

I don’t know how to enable the "0 Rating" limit.

EDIT: next question:

Does anyone know how the getRating function should be changed if not only the rating is zero, but more ratings need to be taken into account (for example, 0.1,4 and 5)? Thus, are they assigned last, if only the rating is 0 or 1 or 4 or 5? If the rating is 0.1,4.5, the most recent entry with a different rating is assigned. If all ratings are 0.1.4 or 5, the most recent ones are assigned. I tried the following, but that did not work:

 getRating <- function(x){ iszero <- x$Rating == 0 | x$Rating == 1 | x$Rating == 4 | x$Rating ==5 if(all(iszero)){ id <- which.max(x$Time) } else { id <- which.max((!iszero)*x$Time) # This trick guarantees taking 0 into account } x[id,] } # Do this over the complete data frame do.call(rbind,lapply(split(Data,Data$belongID),getRating)) # edited per Tyler suggestion'

+6

split r indexing data.table plyr

Max van der heijden May 10 '12 at 12:04

source share

4 answers

Here is my crack in this (interesting problem):

Reading in your data:

 m <- read.table(text="belongID uniqID Time Rating 1 101 5 0 1 102 4 0 2 103 4 0 2 104 3 0 2 105 2 5 3 106 4 2 3 107 5 0 3 108 5 1 ", header=T)

Retrieving the rows you requested:

 m2 <- m[order(m$belongID, -m$Time), ] #Order to get max time first LIST <- split(m2, m$belongID) #split by belongID FUN <- function(x) which(cumsum(x[, 'Rating'])!=0)[1] #find first non zero Rating LIST2 <- lapply(LIST, function(x){ #apply FUN; if NA do 1st row if (is.na(FUN(x))) { x[1, ] } else { x[FUN(x), ] } } ) do.call('rbind', LIST2) #put it all back together

What gives:

  belongID uniqID Time Rating 1 1 101 5 0 2 2 105 2 5 3 3 108 5 1

CHANGE With so many people who answered this problem (interesting to solve IMHO), she asked for a test for the micro library (Windows 7):

 Unit: milliseconds expr min lq median uq max 1 JIGR 6.356293 6.656752 7.024161 8.697213 179.0884 2 JORRIS 2.932741 3.031416 3.153420 3.552554 246.9604 3 PETER 10.851046 11.459896 12.358939 17.164881 216.7284 4 TYLER 2.864625 2.961667 3.066174 3.413289 221.1569

And the schedule:

+4

Tyler rinker May 10, '12 at 13:16

source share

One of the suggestions:

 library(plyr) maxV <- function(b) { if (b[which.max(b$Time), "Rating"] != 0) { return(b[which.max(b$Time), ]) } else if (!all(b$Rating==0)) { bb <- b[order(b$Rating), ] return(bb[bb$Rating != 0,][1, ]) } else { return(b[which.max(b$Time),]) } } a <- read.table(textConnection(" belongID uniqID Time Rating 1 101 5 0 1 102 4 0 2 103 4 0 2 104 3 0 2 105 2 5 3 106 4 2 3 107 5 0 3 108 5 1 "), header=T) ddply(a, .(belongID), maxV) belongID uniqID Time Rating 1 1 101 5 0 2 2 105 2 5 3 3 108 5 1

+3

johannes May 10, '12 at 13:13

source share

EDIT:

Since your main problem is related to your speed, I edited my trick in my original solution, which leads to something like this:

 uniqueMax <- function(m, belongID = 1, time = 3) { t( vapply( split(1:nrow(m), m[,belongID]), function(i, x, time){ is.zero <- x[i,'Rating'] == 0 if(all(is.zero)) is.zero <- FALSE x[i, , drop=FALSE][which.max(x[i,time]*(!is.zero)),] } , m[1,], x=m, time=time ) ) }

My original solution, which is a bit readable than the previous one:

 # Get the rating per belongID getRating <- function(x){ iszero <- x$Rating == 0 if(all(iszero)){ id <- which.max(x$Time) } else { id <- which.max((!iszero)*x$Time) # This trick guarantees taking 0 into account } x[id,] } # Do this over the complete data frame do.call(rbind,lapply(split(Data,Data$belongID),getRating)) # edited per Tyler suggestion

Result:

 tc <- textConnection(' belongID uniqID Time Rating 1 101 5 0 1 102 4 0 2 103 4 0 2 104 3 0 2 105 2 5 3 106 4 2 3 107 5 0 3 108 5 1 ') Data <- read.table(tc,header=TRUE) do.call(rbind,lapply(split(Data,Data$belongID),getRating))

:

  belongID uniqID Time Rating 1 1 101 5 0 2 2 105 2 5 3 3 108 5 1

EDIT: Just for fun, I also benchmarked (using rbenchmark ) on a small dataset with 1000 replications and a large one with 10 repetitions:

Result:

 > benchmark(Joris(Data),Tyler(Data),uniqueMax(Data), + columns=c("test","elapsed","relative"), + replications=1000) test elapsed relative 1 Joris(Data) 1.20 1.025641 2 Tyler(Data) 1.42 1.213675 3 uniqueMax(Data) 1.17 1.000000 > benchmark(Joris(Data2),Tyler(Data2),uniqueMax(Data2), + columns=c("test","elapsed","relative"), + replications=10) test elapsed relative 1 Joris(Data2) 3.63 1.174757 2 Tyler(Data2) 4.02 1.300971 3 uniqueMax(Data2) 3.09 1.000000

Here, I simply wrapped the Joris () and Tyler () functions around our solutions and created Data2 as follows:

 Data2 <- data.frame( belongID = rep(1:1000,each=10), uniqID = 1:10000, Time = sample(1:5,10000,TRUE), Rating = sample(0:5,10000,TRUE) )

+3

Joris meys May 10, '12 at 13:16

source share

Peter Fine · Accepted Answer · 2012-05-10T13:09:59+0000

Here we use a solution that uses data.table to simplify filtering and execute my getRecentRow function separately for each belongID .

 library(data.table) # Load the data from the example. dat = structure(list(belongID = c(1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), uniqID = 101:108, Time = c(5L, 4L, 4L, 3L, 2L, 4L, 5L, 5L), Rating = c(0L, 0L, 0L, 0L, 5L, 2L, 0L, 1L)), .Names = c("belongID", "uniqID", "Time", "Rating"), row.names = c(NA, -8L), class = c("data.table", "data.frame")) dat = data.table(dat) # Convert to data table. # Function to get the row for a given belongID getRecentRow <- function(data) { # Filter by Rating, then order by time, then select first. row = data[Rating != 0][order(-Time)][1] if(!is.na(row$uniqID)) { # A row was found with Rating != 0, return it. return(row) } else { # The row was blank, so filter again without restricting. rating. return(data[order(-Time)][1]) } } # Run getRecentRow on each chunk of dat with a given belongID result = dat[,getRecentRow(.SD), by=belongID] belongID uniqID Time Rating [1,] 1 101 5 0 [2,] 2 105 2 5 [3,] 3 108 5 1

Retrieve the most recent entry, under a certain condition

More articles: