Counting new values ​​that have not previously been met and did not occur in the last group

I am trying to count the number of unique "new" users per month. A new user who has not yet appeared (from the very beginning), I am also trying to count the number of unique users not appearing last month.

The source data looks like

library(dplyr) date <- c("2010-01-10","2010-02-13","2010-03-22","2010-01-11","2010-02-14","2010-03-23","2010-01-12","2010-02-14","2010-03-24") mth <- rep(c("2010-01","2010-02","2010-03"),3) user <- c("123","129","145","123","129","180","180","184","145") dt <- data.frame(date,mth,user) dt <- dt %>% arrange(date) dt date mth user 1 2010-01-10 2010-01 123 2 2010-01-11 2010-01 123 3 2010-01-12 2010-01 180 4 2010-02-13 2010-02 129 5 2010-02-14 2010-02 129 6 2010-02-14 2010-02 184 7 2010-03-22 2010-03 145 8 2010-03-23 2010-03 180 9 2010-03-24 2010-03 145 

The answer should look like

  new <- c(2,2,2,2,2,2,1,1,1) totNew <- c(2,2,2,4,4,4,5,5,5) notLastMonth <- c(2,2,2,2,2,2,2,2,2) tmp <- cbind(dt,new,totNew,notLastMonth) tmp date mth user new totNew notLastMonth 1 2010-01-10 2010-01 123 2 2 2 2 2010-01-11 2010-01 123 2 2 2 3 2010-01-12 2010-01 180 2 2 2 4 2010-02-13 2010-02 129 2 4 2 5 2010-02-14 2010-02 129 2 4 2 6 2010-02-14 2010-02 184 2 4 2 7 2010-03-22 2010-03 145 1 5 2 8 2010-03-23 2010-03 180 1 5 2 9 2010-03-24 2010-03 145 1 5 2 
+5
source share
4 answers

Here's an attempt (explanations inside the code body)

 dt %>% group_by(user) %>% mutate(Count = row_number()) %>% # Count appearances per user group_by(mth) %>% mutate(new = sum(Count == 1)) %>% # Count first appearances per months summarise(new = first(new), # Summarise new users per month (for cumsum) users = list(unique(user))) %>% # Create a list of unique users per month (for notLastMonth) mutate(totNew = cumsum(new), # Calculate overall cummulative sum of unique users notLastMonth = lengths(Map(setdiff, users, lag(users)))) %>% # Compare new users to previous month select(-users) %>% right_join(dt) # Join back to the real data # A tibble: 9 Γ— 6 # mth new totNew notLastMonth date user # <fctr> <int> <int> <int> <fctr> <fctr> # 1 2010-01 2 2 2 2010-01-10 123 # 2 2010-01 2 2 2 2010-01-11 123 # 3 2010-01 2 2 2 2010-01-12 180 # 4 2010-02 2 4 2 2010-02-13 129 # 5 2010-02 2 4 2 2010-02-14 129 # 6 2010-02 2 4 2 2010-02-14 184 # 7 2010-03 1 5 2 2010-03-22 145 # 8 2010-03 1 5 2 2010-03-23 180 # 9 2010-03 1 5 2 2010-03-24 145 
+6
source

Here is another idea starting with the tab "user" on "mth":

 table(dt[c("user", "mth")]) > 0L 

Assuming this path is likely to lead to memory problems, we could start with a rare alternative:

 library(Matrix) tab = as(xtabs( ~ user + mth, dt, sparse = TRUE) > 0L, "TsparseMatrix") tab #5 x 3 sparse Matrix of class "lgTMatrix" # 2010-01 2010-02 2010-03 #123 | . . #129 . | . #145 . . | #180 | . | #184 . | . 

Then, having "mth" (as a column index), each "user" first appeared:

 tapply( tab@j , rownames(tab)[ tab@i + 1L], min) + 1L #123 129 145 180 184 # 1 2 3 1 2 

we can find the number of new entries on "mth":

 new = setNames(tabulate(tapply( tab@j , rownames(tab)[ tab@i + 1L], min) + 1L, ncol(tab)), colnames(tab)) new #2010-01 2010-02 2010-03 # 2 2 1 

and the total amount of new entries:

 totNew = cumsum(new) totNew #2010-01 2010-02 2010-03 # 2 4 5 

And, subtracting the number of "users" from "mth", which exist both in "mth" and in the previous one:

 setNames(colSums(cbind(FALSE, tab[, -ncol(tab)]) & tab), colnames(tab)) #2010-01 2010-02 2010-03 # 0 0 0 

from the number of users per month:

 colSums(tab) #2010-01 2010-02 2010-03 # 2 2 2 

we get:

 notLast = colSums(tab) - colSums(cbind(FALSE, tab[, -ncol(tab)]) & tab) notLast #2010-01 2010-02 2010-03 # 2 2 2 

One way to achieve the desired result can be:

 merge(dt, data.frame(mth = names(new), new, totNew, notLast), by = "mth") # mth date user new totNew notLast #1 2010-01 2010-01-10 123 2 2 2 #2 2010-01 2010-01-11 123 2 2 2 #3 2010-01 2010-01-12 180 2 2 2 #4 2010-02 2010-02-13 129 2 4 2 #5 2010-02 2010-02-14 129 2 4 2 #6 2010-02 2010-02-14 184 2 4 2 #7 2010-03 2010-03-22 145 1 5 2 #8 2010-03 2010-03-23 180 1 5 2 #9 2010-03 2010-03-24 145 1 5 2 
+4
source

Since no one has posted it yet, here is my preferred way:

 library(zoo) dt <- dt %>% mutate(ym = as.yearmon(mth)) ct_dt = dt %>% distinct(user, ym) %>% arrange(user, ym) %>% group_by(user) %>% mutate(last_ym = dplyr::lag(ym)) %>% group_by(ym) %>% summarise( new = sum(is.na(last_ym)), not_last_ym = sum(is.na(last_ym) | 12*(ym - last_ym) > 1) ) # # A tibble: 3 x 3 # ym new not_last_ym # <S3: yearmon> <int> <int> # 1 Jan 2010 2 2 # 2 Feb 2010 2 2 # 3 Mar 2010 1 2 

Here you can take cumsum new if you really need the totNew column; and you can left_join ct_dt with dt if you really want to view this data (dimly) stretched over several lines.


Or with data.table ...

 library(zoo) library(data.table) setDT(dt) dt[, ym := as.yearmon(mth)] ct_dt = setorder(unique(dt[, .(user, ym)]))[, last_ym := shift(ym) , by=user][, .( new = sum(is.na(last_ym)), not_last_ym = sum(is.na(last_ym) | 12*(ym - last_ym) > 1) ), by=ym] 
+3
source

Here's a clean, basic R solution. It works best when variables are not factors and suggests that the data is sorted by month.

 # get list of active monthly users activeUsers <- lapply(unique(dt$mth), function(i) unique(dt[dt$mth==i, "user"])) # get accumulating list of all users allUsers <- Reduce(union, activeUsers, accumulate=TRUE) 

Now all monthly users are stored in activeUsers, and a growing list of all users until a certain month is stored in allUsers. With this information, we can easily calculate the first two variables.

 # get the calculations totNew <- lengths(allUsers) new <- c(totNew[1], diff(totNew)) notLastMonth <- c(totNew[1], lengths(lapply(seq_along(activeUsers)[-1], function(i) setdiff(activeUsers[[i]], activeUsers[[i-1]])))) 

The lengths function efficiently calculates the length of each list item. The second line uses diff to calculate the number of new users. The second and third lines add the original value (2) using the totNew variable. The third line is a bit involved and uses setdiff and lapply to create a set of active users for the month that was not in the previous month. lengths is used again for counting.

 #merge on to data set merge(dt, data.frame(mth=unique(dt$mth), new=new, totNew=totNew, notLastMonth=notLastMonth), by="mth") mth date user new totNew notLastMonth 1 2010-01 2010-01-10 123 2 2 2 2 2010-01 2010-01-12 180 2 2 2 3 2010-01 2010-01-11 123 2 2 2 4 2010-02 2010-02-13 129 2 4 2 5 2010-02 2010-02-14 129 2 4 2 6 2010-02 2010-02-14 184 2 4 2 7 2010-03 2010-03-23 180 1 5 2 8 2010-03 2010-03-22 145 1 5 2 9 2010-03 2010-03-24 145 1 5 2 

<strong> data

 dt <- data.frame(date,mth,user, stringsAsFactors=FALSE) 
+2
source

Source: https://habr.com/ru/post/1262543/


All Articles