Here is another idea starting with the tab "user" on "mth":
table(dt[c("user", "mth")]) > 0L
Assuming this path is likely to lead to memory problems, we could start with a rare alternative:
library(Matrix) tab = as(xtabs( ~ user + mth, dt, sparse = TRUE) > 0L, "TsparseMatrix") tab #5 x 3 sparse Matrix of class "lgTMatrix" # 2010-01 2010-02 2010-03 #123 | . . #129 . | . #145 . . | #180 | . | #184 . | .
Then, having "mth" (as a column index), each "user" first appeared:
tapply( tab@j , rownames(tab)[ tab@i + 1L], min) + 1L
we can find the number of new entries on "mth":
new = setNames(tabulate(tapply( tab@j , rownames(tab)[ tab@i + 1L], min) + 1L, ncol(tab)), colnames(tab)) new
and the total amount of new entries:
totNew = cumsum(new) totNew
And, subtracting the number of "users" from "mth", which exist both in "mth" and in the previous one:
setNames(colSums(cbind(FALSE, tab[, -ncol(tab)]) & tab), colnames(tab)) #2010-01 2010-02 2010-03 # 0 0 0
from the number of users per month:
colSums(tab)
we get:
notLast = colSums(tab) - colSums(cbind(FALSE, tab[, -ncol(tab)]) & tab) notLast #2010-01 2010-02 2010-03 # 2 2 2
One way to achieve the desired result can be:
merge(dt, data.frame(mth = names(new), new, totNew, notLast), by = "mth")