Marking Unique Values ​​in R

My data looks like this:

data <- matrix(c("1","install","2015-10-23 14:07:20.000000",
                 "2","install","2015-10-23 14:08:20.000000",
                 "3","install","2015-10-23 14:07:25.000000",
                 "3","sale","2015-10-23 14:08:20.000000",
                 "4","install","2015-10-23 14:07:20.000000",
                 "4","sale","2015-10-23 14:09:20.000000",
                 "4","sale","2015-10-23 14:11:20.000000"),
               ncol=3, byrow=TRUE)
colnames(data) <- c("id","event","time")

I would like to add a fourth column called label, in which I put each row accordingly on some values. In this case:

  • label "0" if the identifier is unique
  • label "1" if the identifier is not unique and is associated with 1 sale
  • label "2" if the identifier is not unique and is associated with two sales

etc. up to n sales.

It should look like this:

data1 <- matrix(c("1","install","2015-10-23 14:07:20.000000","0",
                  "2","install","2015-10-23 14:08:20.000000","0",
                  "3","install","2015-10-23 14:07:25.000000","1",
                  "3","sale","2015-10-23 14:08:20.000000","1",
                  "4","install","2015-10-23 14:07:20.000000","2",
                  "4","sale","2015-10-23 14:09:20.000000","2",
                  "4","sale","2015-10-23 14:11:20.000000","2"),
                 ncol=4, byrow=TRUE)

I don’t understand what the best approach in R is to create β€œtags” based on conditions ... maybe dplyr::mutate?

+4
source share
3 answers

Updated to reflect "etc. up to n sales." - requirement.

dplyr :

library(dplyr)
data <- as.data.frame(data)
data %>% 
  group_by(id) %>% 
  mutate(label = if(n() == 1) 0 else as.numeric(sum(event == "sale")))

#Source: local data frame [7 x 4]
#Groups: id [4]
#
#      id   event                       time label
#  (fctr)  (fctr)                     (fctr) (dbl)
#1      1 install 2015-10-23 14:07:20.000000     0
#2      2 install 2015-10-23 14:08:20.000000     0
#3      3 install 2015-10-23 14:07:25.000000     1
#4      3    sale 2015-10-23 14:08:20.000000     1
#5      4 install 2015-10-23 14:07:20.000000     2
#6      4    sale 2015-10-23 14:09:20.000000     2
#7      4    sale 2015-10-23 14:11:20.000000     2

data.table :

library(data.table)
data <- as.data.table(data)  # or setDT(data) if it already a data.frame
data[, label := if(.N == 1) 0 else as.numeric(sum(event == "sale")), by=id]
+4

base R:

sum "sale" id ave. , uniq. "0" . cbind . data.frame, .

indx <- ave(data[,2], data[,1], FUN=function(x) sum(x == "sale"))
uniq <- table(data[,1]) == 1
indx[data[,1] %in% which(uniq)] <- "0"
cbind.data.frame(data, indx)
#   id   event                       time count
# 1  1    sale 2015-10-23 14:07:20.000000     0
# 2  2 install 2015-10-23 14:08:20.000000     0
# 3  3 install 2015-10-23 14:07:25.000000     1
# 4  3    sale 2015-10-23 14:08:20.000000     1
# 5  4 install 2015-10-23 14:07:20.000000     2
# 6  4    sale 2015-10-23 14:09:20.000000     2
# 7  4    sale 2015-10-23 14:11:20.000000     2
+4

dplyr , data.frame, :

library(dplyr)
left_join(data,
              data %>%
                group_by(id) %>%
                summarise(count = n(), sales = sum(event == "sale"))
) %>%
  mutate(label = ifelse(count == 1, 0, sales)) %>%
  select(-count, -sales)

> data
  id   event                       time label
1  1 install 2015-10-23 14:07:20.000000     0
2  2 install 2015-10-23 14:08:20.000000     0
3  3 install 2015-10-23 14:07:25.000000     1
4  3    sale 2015-10-23 14:08:20.000000     1
5  4 install 2015-10-23 14:07:20.000000     2
6  4    sale 2015-10-23 14:09:20.000000     2
7  4    sale 2015-10-23 14:11:20.000000     2
0

Source: https://habr.com/ru/post/1614747/


All Articles