How to create a rank variable under certain conditions?

Question

How to create a rank variable under certain conditions?

My data contains the time variable and the selected brand, as shown below. time indicates the time of purchase, and the selected symbol indicates the brand acquired at that time.

Using this data, I would like to create a rank variable as shown in the third column, fourth column, etc.

The brand rank (e.g. brand1 - brand3) should be based on the last 36 hours. Thus, in order to calculate the rank for the second row, which has a store time like "2013-09-01 08:54:00 UTC", the rank should be based on all values chosenbrandwithin 36 hours before the time. ( brand1the second line should not be within 36 hours)

Therefore rank_brand1, rank_brand2, rank_brand3, rank_bran4 ,, are my desired variables.

If I want to create rank_brand5, rank_brand6 also ...

Is there an easy way?

In addition, if I want to do this individually (if each client has several acquired stories), how can I do this?

The data is given below.

          shoptime          chosenbrand  rank_brand1 rank_brand2 rank_brand3, ...
  2013-09-01 08:35:00 UTC      brand1          NA         NA          NA
  2013-09-01 08:54:00 UTC      brand1          1          NA          NA
  2013-09-01 09:07:00 UTC      brand2          1          2          NA
  2013-09-01 09:08:00 UTC      brand3          1          2          3
  2013-09-01 09:11:00 UTC      brand5          1          2          3
  2013-09-01 09:14:00 UTC      brand2          1          2          3
  2013-09-01 09:26:00 UTC      brand6          1          1          3
  2013-09-01 09:26:00 UTC      brand2          1          1          3
  2013-09-01 09:29:00 UTC      brand2          2          1          3
  2013-09-01 09:32:00 UTC      brand4          2          1          3

Here is the code for the data

dat <- data.frame(shoptime = c("2013-09-01 08:35:00 UTC", "2013-09-01 08:54:00 UTC", "2013-09-01 09:07:00 UTC" ,"2013-09-01 09:08:00 UTC", "2013-09-01 09:11:00 UTC", "2013-09-01 09:14:00 UTC",
                           "2013-09-01 09:26:00 UTC", "2013-09-01 09:26:00 UTC" ,"2013-09-01 09:29:00 UTC", "2013-09-01 09:32:00 UTC"),
                  chosenbrand = c("brand1", "brand1", "brand2", "brand3", "brand5", "brand2", "brand6", "brand2"  ,  "brand2"  ,   "brand4"   ),
                  rank_brand1 = NA,
                  rank_brand2 = NA,
                 rank_brand3 = NA,
                  stringsAsFactors = FALSE)

+4

r dataframe data.table dplyr plyr

John legend2 Dec 13 '17 at 17:31

source share

2 answers

- ( loop) . , OP:

library(dplyr)

dat <- data.frame(shoptime = c("2013-09-01 08:35:00 UTC", "2013-09-01 08:54:00 UTC", "2013-09-01 09:07:00 UTC" ,"2013-09-01 09:08:00 UTC", "2013-09-01 09:11:00 UTC", "2013-09-01 09:14:00 UTC",
                               "2013-09-01 09:26:00 UTC", "2013-09-01 09:26:00 UTC" ,"2013-09-01 09:29:00 UTC", "2013-09-01 09:32:00 UTC"),
                  chosenbrand = c("brand1", "brand1", "brand2", "brand3", "brand5", "brand2", "brand6", "brand2"  ,  "brand2"  ,   "brand4"   ),
                  rank_brand1 = NA,
                  rank_brand2 = NA,
                  rank_brand3 = NA,
                  stringsAsFactors = FALSE)

#Write a function that data.frame and calculate rank
Calculate.Rank <- function(x){
  #loop through each row and calculate count for each brand 
  for(i in 1:nrow(x)){
    #DateTime of the current row. 
    currentrow.time <- as.POSIXlt(x$shoptime[i])
    #calculate number of times brand1 appears
    x$rank_brand1[i] <- nrow(filter(x, as.POSIXlt(shoptime) <= currentrow.time & as.POSIXlt(shoptime) >= (currentrow.time-36*60*60) & chosenbrand == "brand1" ))
    #calculate number of times brand2 appears
    x$rank_brand2[i] <- nrow(filter(x, as.POSIXlt(shoptime) <= currentrow.time & as.POSIXlt(shoptime) >= (currentrow.time-36*60*60) & chosenbrand == "brand2" ))    
    #calculate number of times brand3 appears
    x$rank_brand3[i] <- nrow(filter(x, as.POSIXlt(shoptime) <= currentrow.time & as.POSIXlt(shoptime) >= (currentrow.time-36*60*60) & chosenbrand == "brand3" ))

#Replace the 0 values with NA. I dont think this right approach as one can consider those count to be 0 anyway

    if(x$rank_brand1[i] == 0 ){
      x$rank_brand1[i] = NA
    }

    if(x$rank_brand2[i] == 0 ){
      x$rank_brand2[i] = NA
    }
    if(x$rank_brand3[i] == 0 ){
      x$rank_brand3[i] = NA
    }    

  }

  #Now count of brand1, brand2 and brand3 is available now. Lets calculate rank.
  new.x <- data.frame(x[,1:2], t(apply(-x[,3:5], 1, rank, ties.method='min', na.last = "keep")))

  print(new.x)
}

Calculate.Rank(dat)

data.frame new.x :

                shoptime chosenbrand rank_brand1 rank_brand2 rank_brand3
1  2013-09-01 08:35:00 UTC      brand1           1          NA          NA
2  2013-09-01 08:54:00 UTC      brand1           1          NA          NA
3  2013-09-01 09:07:00 UTC      brand2           1           2          NA
4  2013-09-01 09:08:00 UTC      brand3           1           2           2
5  2013-09-01 09:11:00 UTC      brand5           1           2           2
6  2013-09-01 09:14:00 UTC      brand2           1           1           3
7  2013-09-01 09:26:00 UTC      brand6           2           1           3
8  2013-09-01 09:26:00 UTC      brand2           2           1           3
9  2013-09-01 09:29:00 UTC      brand2           2           1           3
10 2013-09-01 09:32:00 UTC      brand4           2           1           3

0

MKR 13 . '17 22:36

Uwe · Accepted Answer · 2017-12-14T08:42:08+0000

It's complicated. The solution below uses non-equi aggregations to aggregate for 36-hour periods, dcast()to change form from long to wide, and a second connection to the original dat. There may be an arbitrary number of brands.

library(data.table)
library(lubridate)

setDT(dat)[, shoptime := as_datetime(shoptime)]
setorder(dat, shoptime) # not required, just for convenience of observers
dat[.(lb = shoptime - hours(36), ub = shoptime), on = .(shoptime >= lb, shoptime < ub), 
    nomatch = 0L, by = .EACHI, 
    .SD[, .N, by = brand][, rank := frank(-N, ties.method="dense")]][
      , dcast(unique(.SD[, -1]), shoptime ~ brand, value.var = "rank")][
        dat, on = "shoptime"]

               shoptime brand1 brand2 brand3 brand5 brand6  brand
 1: 2013-09-01 08:35:00     NA     NA     NA     NA     NA brand1
 2: 2013-09-01 08:54:00      1     NA     NA     NA     NA brand1
 3: 2013-09-01 09:07:00      1     NA     NA     NA     NA brand2
 4: 2013-09-01 09:08:00      1      2     NA     NA     NA brand3
 5: 2013-09-01 09:11:00      1      2      2     NA     NA brand5
 6: 2013-09-01 09:14:00      1      2      2      2     NA brand2
 7: 2013-09-01 09:26:00      1      1      2      2     NA brand6
 8: 2013-09-01 09:26:00      1      1      2      2     NA brand2
 9: 2013-09-01 09:29:00      2      1      3      3      3 brand2
10: 2013-09-01 09:32:00      2      1      3      3      3 brand4

Explanation

dat[.(lb = shoptime - hours(36), ub = shoptime), on = .(shoptime >= lb, shoptime < ub), 
    nomatch = 0L, by = .EACHI, 
    .SD[, .N, by = brand][, rank := frank(-N, ties.method="dense")]]

returns aggregated results in 36 hours:

               shoptime            shoptime  brand N rank
 1: 2013-08-30 20:54:00 2013-09-01 08:54:00 brand1 1    1
 2: 2013-08-30 21:07:00 2013-09-01 09:07:00 brand1 2    1
 3: 2013-08-30 21:08:00 2013-09-01 09:08:00 brand1 2    1
 4: 2013-08-30 21:08:00 2013-09-01 09:08:00 brand2 1    2
 5: 2013-08-30 21:11:00 2013-09-01 09:11:00 brand1 2    1
 6: 2013-08-30 21:11:00 2013-09-01 09:11:00 brand2 1    2
 7: 2013-08-30 21:11:00 2013-09-01 09:11:00 brand3 1    2
 8: 2013-08-30 21:14:00 2013-09-01 09:14:00 brand1 2    1
 9: 2013-08-30 21:14:00 2013-09-01 09:14:00 brand2 1    2
10: 2013-08-30 21:14:00 2013-09-01 09:14:00 brand3 1    2
11: 2013-08-30 21:14:00 2013-09-01 09:14:00 brand5 1    2
12: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand1 2    1
13: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand2 2    1
14: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand3 1    2
15: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand5 1    2
16: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand1 2    1
17: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand2 2    1
18: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand3 1    2
19: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand5 1    2
20: 2013-08-30 21:29:00 2013-09-01 09:29:00 brand1 2    2
21: 2013-08-30 21:29:00 2013-09-01 09:29:00 brand2 3    1
22: 2013-08-30 21:29:00 2013-09-01 09:29:00 brand3 1    3
23: 2013-08-30 21:29:00 2013-09-01 09:29:00 brand5 1    3
24: 2013-08-30 21:29:00 2013-09-01 09:29:00 brand6 1    3
25: 2013-08-30 21:32:00 2013-09-01 09:32:00 brand1 2    2
26: 2013-08-30 21:32:00 2013-09-01 09:32:00 brand2 4    1
27: 2013-08-30 21:32:00 2013-09-01 09:32:00 brand3 1    3
28: 2013-08-30 21:32:00 2013-09-01 09:32:00 brand5 1    3
29: 2013-08-30 21:32:00 2013-09-01 09:32:00 brand6 1    3
               shoptime            shoptime  brand N rank

Then this intermediate result is converted from long to wide format:

dat[.(lb = shoptime - hours(36), ub = shoptime), on = .(shoptime >= lb, shoptime < ub), 
    nomatch = 0L, by = .EACHI, 
    .SD[, .N, by = brand][, rank := frank(-N, ties.method="dense")]][
      , dcast(unique(.SD[, -1]), shoptime ~ brand, value.var = "rank")]

              shoptime brand1 brand2 brand3 brand5 brand6
1: 2013-09-01 08:54:00      1     NA     NA     NA     NA
2: 2013-09-01 09:07:00      1     NA     NA     NA     NA
3: 2013-09-01 09:08:00      1      2     NA     NA     NA
4: 2013-09-01 09:11:00      1      2      2     NA     NA
5: 2013-09-01 09:14:00      1      2      2      2     NA
6: 2013-09-01 09:26:00      1      1      2      2     NA
7: 2013-09-01 09:29:00      2      1      3      3      3
8: 2013-09-01 09:32:00      2      1      3      3      3

dat (. ).

dat <- data.frame(
  shoptime = c("2013-09-01 08:35:00 UTC", "2013-09-01 08:54:00 UTC", "2013-09-01 09:07:00 UTC" ,"2013-09-01 09:08:00 UTC", "2013-09-01 09:11:00 UTC", "2013-09-01 09:14:00 UTC",
               "2013-09-01 09:26:00 UTC", "2013-09-01 09:26:00 UTC" ,"2013-09-01 09:29:00 UTC", "2013-09-01 09:32:00 UTC"),
  brand = c("brand1", "brand1", "brand2", "brand3", "brand5", "brand2", "brand6", "brand2"  ,  "brand2"  ,   "brand4"   ),
  stringsAsFactors = FALSE)

How to create a rank variable under certain conditions?

Explanation

More articles: