R - calculate the number of elements over time using start and end dates

I want to calculate the number of items over time using start and end dates.

Some sample data

START <- as.Date(c("2014-01-01", "2014-01-02","2014-01-03","2014-01-03")) END <- as.Date(c("2014-01-04", "2014-01-03","2014-01-03","2014-01-04")) df <- data.frame(START,END) df 

gives

  START END 1 2014-01-01 2014-01-04 2 2014-01-02 2014-01-03 3 2014-01-03 2014-01-03 4 2014-01-03 2014-01-04 

A table showing the number of these elements in time (depending on their start and end time) is as follows:

 DATETIME COUNT 2014-01-01 1 2014-01-02 2 2014-01-03 4 2014-01-04 2 

Can this be done with R, especially using dplyr? Many thanks.

+6
source share
5 answers

That would do it. You can change the column names as needed.

 as.data.frame(table(Reduce(c, Map(seq, df$START, df$END, by = 1)))) # Var1 Freq # 1 2014-01-01 1 # 2 2014-01-02 2 # 3 2014-01-03 4 # 4 2014-01-04 2 

As noted in the comments, Var1 in the above solution is now a factor, not a date. To save the date class in the first column, you could do one more work on the above solution or use plyr::count instead of as.data.frame(table(...))

 library(plyr) count(Reduce(c, Map(seq, df$START, df$END, by = 1))) # x freq # 1 2014-01-01 1 # 2 2014-01-02 2 # 3 2014-01-03 4 # 4 2014-01-04 2 
+6
source

You can use data.table

 library(data.table) DT <- setDT(df)[, list(DATETIME= seq(START, END, by=1)), by=1:nrow(df)][, list(COUNT=.N), by=DATETIME] DT # DATETIME COUNT #1: 2014-01-01 1 #2: 2014-01-02 2 #3: 2014-01-03 4 #4: 2014-01-04 2 

In version 1.9.4+, you can also use the foverlaps() function to perform a “bridging connection”. It is more efficient since it does not need to first expand the dates for each row and then count. Here's how:

 require(data.table) ## 1.9.4 setDT(df) ## convert your data.frame to data.table by reference ## 1. Some preprocessing: # create a lookup - the dates for which you need the count, and set key dates = seq(as.Date("2014-01-01"), as.Date("2014-01-04"), by="days") lookup = data.table(START=dates, END=dates, key=c("START", "END")) ## 2. Now find overlapping coordinates # for each row in `df` get all the rows it overlaps with in `lookup` ans = foverlaps(df, lookup, type="any", which=TRUE) 

Now we just need to group yid (= indexes in lookup ) and read:

 ## 3. count ans[, .N, by=yid] # yid N # 1: 1 1 # 2: 2 2 # 3: 3 4 # 4: 4 2 

The first column corresponds to the line numbers in the lookup . If some digits are missing, then for them the number is 0.

+2
source

Using dplyr and grouped data:

 data_frame( START = as.Date(c("2014-01-01", "2014-01-02","2014-01-03","2014-01-03")), END = as.Date(c("2014-01-04", "2014-01-03","2014-01-03","2014-01-04")) ) -> df rbind(cbind(group = 'a', df),cbind(group = 'b', df)) %>% as_data_frame->df df df %>% group_by(.,group) %>% do(data.frame(table(Reduce(c, Map(seq, .$START, .$END, by = 1))))) 

This is a common problem when, for example, you want to find the number of logins on different pages / machines, etc., given the time intervals for users

 > df Source: local data frame [8 x 3] group START END (chr) (date) (date) 1 a 2014-01-01 2014-01-04 2 a 2014-01-02 2014-01-03 3 a 2014-01-03 2014-01-03 4 a 2014-01-03 2014-01-04 5 b 2014-01-01 2014-01-04 6 b 2014-01-02 2014-01-03 7 b 2014-01-03 2014-01-03 8 b 2014-01-03 2014-01-04 > > df %>% + group_by(.,group) %>% + do(data.frame(table(Reduce(c, Map(seq, .$START, .$END, by = 1))))) Source: local data frame [8 x 3] Groups: group [2] group Var1 Freq (chr) (fctr) (int) 1 a 2014-01-01 1 2 a 2014-01-02 2 3 a 2014-01-03 4 4 a 2014-01-04 2 5 b 2014-01-01 1 6 b 2014-01-02 2 7 b 2014-01-03 4 8 b 2014-01-04 2 
+1
source

Using dplyr and foreach :

 library(dplyr) library(foreach) df <- data.frame(START = as.Date(c("2014-01-01", "2014-01-02", "2014-01-03", "2014-01-03")), END = as.Date(c("2014-01-04", "2014-01-03", "2014-01-03", "2014-01-04"))) df r <- foreach(DATETIME = seq(min(df$START), max(df$END), by = 1), .combine = rbind) %do% { df %>% filter(DATETIME >= START & DATETIME <= END) %>% summarise(DATETIME, COUNT = n()) } r 
0
source

I just proposed another lubridate based solution which is faster for large data frames with a wide date range in the new and related SO publication here

0
source

Source: https://habr.com/ru/post/1204386/


All Articles