Calculate the date difference between rows taking into account different groups within the data frame

Question

Calculate the date difference between rows taking into account different groups within the data frame

I work with some processing information applied to animals in R First, I would like to describe the structure of my information (in the end I will add a version of dput() ). My DF data looks like this:

  Treatment_ID Start_Date Valid 1 0031 2011-05-01 2011-05-30 2 0031 2011-05-01 2011-06-30 3 0045 2012-02-01 2012-03-01 4 0057 2012-04-01 2012-04-30 5 0057 2012-04-01 2012-05-30 6 0098 2012-10-01 2012-10-30

It contains 56 lines and three variables Treatment_ID (5 types of treatment), Start_Date (Date of start of treatment) and Valid (End date of treatment). For example, Treatment_ID 0031 has two observations because it started in May 2011 and ended in June 2011. Then a new appeal 0045 began in February 2012 and ended in March 2012 (only one observation). The same structure applies to all groups within the DF . I need to calculate the difference between the months between each treatment and each treatment, using some conditions. I will use the first two procedures to show this:

  Treatment_ID Start_Date Valid 1 0031 2011-05-01 2011-05-30 2 0031 2011-05-01 2011-06-30 3 0045 2012-02-01 2012-03-01 4 0057 2012-04-01 2012-04-30

In this example, I have two lines for the first treatment, where the variable Treatment_ID is equal. When this happens, you need to calculate the difference in months for the Valid variable. When a new processing appears, it is necessary to calculate the difference in months between Start_Date and Valid . Note that when processing more than one case, the difference is obtained using the Valid variable for cases in this group, but when Treatment_ID changed, the difference must be obtained using the Start_Date and Valid variables. To get this Break_Months variable, I used the following structure:

 DF$Break_Months=NA for(i in c(2:(length(DF$Break_Months)))) { DF$Break_Months[i]=ifelse(DF$Treatment_ID[i]==DF$Treatment_ID[i-1],round(as.numeric(DF$Valid[i]-DF$Valid[i-1])/30,0), round(as.numeric(DF$Start_Date[i]-DF$Valid[i-1])/30,0)) }

This for , when Treatment_ID equal, calculates the difference between the actual line and the previous one with the Valid variable, and when they are different, the difference is calculated using Start_Date and Valid . The first Break_Months value is NA , because there is no previous value for comparison. The problem occurred at the end of DF when I used the previous lines of code.

  Treatment_ID Start_Date Valid Break_Months 47 0098 2012-10-01 2016-07-30 1 48 0098 2012-10-01 2016-08-31 1 49 0031 2016-09-01 2016-09-30 0 50 0031 2016-09-01 2016-10-30 1 51 0031 2016-09-01 2016-11-30 1 52 0031 2016-09-01 2016-12-30 1 53 0031 2016-09-01 2017-01-30 1 54 0031 2016-09-01 2017-03-02 1 55 0031 2016-09-01 2017-03-30 1 56 0012 2017-03-01 2017-03-30 -1

Treatment_ID 0012 has only one case, because it is new, and the Valid date is the date of the last treatment observation 0031 . Since Treatment_ID 0031 used in other months, then the difference is calculated using previous observations within the treatment. In the case of 0012 this is not possible, since the Valid date coincides with the fact that the last observation 0031 and 0012 does not contain more observations because it is new. When this happens, a comparison must be made with the last observation of the group preceding 0031 , this is 0098 . Using the term 0012 not equal to 0098 Break_Months calculated by the difference between 2017-03-01 ( Start_Date ) and 2016-08-31 ( Valid ), giving the value 6 by the same mechanic in the for structure, and not -1 .

My question is how to include this consideration in for . It was very difficult to try to do this, because I do not know how to integrate the comparison related to the date (if they are equal, as in the example), and look for the previous group up to the one that contains the same date. I tried using the lag function from the dplyr package to avoid for , but the results do not match. The version of dput() DF follows:

 DF<-structure(list(Treatment_ID = c("0031", "0031", "0045", "0057", "0057", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0031", "0031", "0031", "0031", "0031", "0031", "0031", "0012"), Start_Date = structure(c(1304208000, 1304208000, 1328054400, 1333238400, 1333238400, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1472688000, 1472688000, 1472688000, 1472688000, 1472688000, 1472688000, 1472688000, 1488326400), class = c("POSIXct", "POSIXt"), tzone = "UTC"), Valid = structure(c(1306713600, 1309392000, 1330560000, 1335744000, 1338336000, 1351555200, 1354233600, 1356825600, 1359504000, 1362182400, 1364601600, 1367280000, 1369872000, 1372550400, 1375142400, 1377820800, 1380499200, 1383091200, 1385769600, 1388361600, 1391040000, 1393718400, 1396137600, 1398816000, 1401408000, 1404086400, 1412035200, 1414627200, 1417305600, 1419897600, 1422576000, 1425254400, 1427673600, 1432944000, 1435622400, 1440892800, 1443571200, 1446163200, 1448841600, 1451433600, 1454112000, 1456790400, 1459296000, 1461974400, 1464566400, 1467244800, 1469836800, 1472601600, 1475193600, 1477785600, 1480464000, 1483056000, 1485734400, 1488412800, 1490832000, 1490832000), class = c("POSIXct", "POSIXt"), tzone = "UTC")), .Names = c("Treatment_ID", "Start_Date", "Valid"), row.names = c(NA, -56L), class = "data.frame")

Thank you for your help.

Update One of these solutions worked perfectly. Now I have a little problem when I have to calculate similar variables. First, I compute the Elapsed variable, i.e. the difference between Valid and Start_Date . I am using the following code:

 DF$Elapsed=round(as.numeric(DF$Valid-DF$Start_Date)/30,0)

Then a dilemma arises. I need to calculate the following two variables Last1 and Last2 . For this, I use the following code:

 #Compute Last1 DF$Last1=NA DF$Last1[1]=0 for(j in c(2:length(DF$Last1))) { DF$Last1[j]=ifelse(DF$Treatment_ID[j]==DF$Treatment_ID[j-1],DF$Last1[j-1], ifelse(DF$Treatment_ID[j]!=DF$Treatment_ID[j-1],DF$Elapsed[j-1],0)) }

The code worked in parts because I have a similar problem with the Break_Months variable. In this case, since 0031 and 0012 have the same Valid value, the Treatment_ID comparison should not be performed with the last value 0031 , where 7 is assigned due to the loop logic ( Elapsed variable). In this case, the correct value is 48, because the comparison should be structured with the last observation of group 0098 , then we will get 48. I tried to change using last_obs_index , but I can not get the correct result.

  Treatment_ID Start_Date Valid Break_Months Elapsed Last1 47 0098 2012-10-01 2016-07-30 1 47 2 48 0098 2012-10-01 2016-08-31 1 48 2 49 0031 2016-09-01 2016-09-30 0 1 48 50 0031 2016-09-01 2016-10-30 1 2 48 51 0031 2016-09-01 2016-11-30 1 3 48 52 0031 2016-09-01 2016-12-30 1 4 48 53 0031 2016-09-01 2017-01-30 1 5 48 54 0031 2016-09-01 2017-03-02 1 6 48 55 0031 2016-09-01 2017-03-30 1 7 48 56 0012 2017-03-01 2017-03-30 6 1 7

For Last2 variable Last2 I use the following code:

 #Compute Last2 DF$Last2=NA DF$Last2[1]=0 for(k in c(2:length(DF$Last2))) { DF$Last2[k]=ifelse(DF$Treatment_ID[k]==DF$Treatment_ID[k-1],DF$Last2[k-1], ifelse(DF$Treatment_ID[k]!=DF$Treatment_ID[k-1],DF$Break_Months[k],0)) }

In this case, it seems to have worked, but it is not. Although 6 is true, the comparison is not very well defined, since 0012 and 0031 have the same Valid date, and the optimal observation uses the latest observation from group 0098 . Therefore, the value of Break_Months . Again, I could not fix the loop with the correct logic defined with last_obs_index .

  Treatment_ID Start_Date Valid Break_Months Elapsed Last1 Last2 47 0098 2012-10-01 2016-07-30 1 47 2 4 48 0098 2012-10-01 2016-08-31 1 48 2 4 49 0031 2016-09-01 2016-09-30 0 1 48 0 50 0031 2016-09-01 2016-10-30 1 2 48 0 51 0031 2016-09-01 2016-11-30 1 3 48 0 52 0031 2016-09-01 2016-12-30 1 4 48 0 53 0031 2016-09-01 2017-01-30 1 5 48 0 54 0031 2016-09-01 2017-03-02 1 6 48 0 55 0031 2016-09-01 2017-03-30 1 7 48 0 56 0012 2017-03-01 2017-03-30 6 1 7 6

Thanks for all the help this time, is it possible to get tips on how to adapt the loops to properly compare comparisons.

+5

r dplyr

Duck May 26, '17 at 21:17

source share

2 answers

It uses a method that contains several additional dplyr functions, such as lag and if_else . It provides a comparison date (stored here to verify operability), subtracts the previous Effective date from it, and then converts it into a rounded number of “months” (30-day periods).

 library(dplyr) mutate(DF, comparison_date = if_else(Treatment_ID == lag(Treatment_ID), Valid, Start_Date), Break_Months = difftime(comparison_date, lag(Valid), units = "days"), Break_Months = as.numeric(round(Break_Months / 30))) #> Treatment_ID Start_Date Valid comparison_date Break_Months #> 1 0031 2011-05-01 2011-05-30 <NA> NA #> 2 0031 2011-05-01 2011-06-30 2011-06-30 1 #> 3 0045 2012-02-01 2012-03-01 2012-02-01 7 #> 4 0057 2012-04-01 2012-04-30 2012-04-01 1 #> 5 0057 2012-04-01 2012-05-30 2012-05-30 1 #> 6 0098 2012-10-01 2012-10-30 2012-10-01 4 #> 7 0098 2012-10-01 2012-11-30 2012-11-30 1 #> 8 0098 2012-10-01 2012-12-30 2012-12-30 1 #> 9 0098 2012-10-01 2013-01-30 2013-01-30 1 #> 10 0098 2012-10-01 2013-03-02 2013-03-02 1 ...

+3

Simon jackson May 30, '17 at 23:43

source share

johnckane · Accepted Answer · 2017-06-01T20:11:04+0000

To do this in the for loop, you need to add additional criteria for your calculation when the Treatment_ID value is different from the previous observation.

If the Treatment_ID value does not match the previous Start_Date , calculate the difference between the Start_Date for this case and the Valid value for the last case of the last Treatment_ID , where the last Valid also different.

To do this, you need to know the DF indices, where Treatment_ID values change, and Valid values change. You will need the Lag function from Hmisc

 library(Hmisc) new_obs_index <- which(DF$Treatment_ID != Lag(DF$Treatment_ID,1) & DF$Valid != Lag(DF$Valid))

This provides an index of the start of new observations, we really want the index of the last observation to be the last Treatment_ID before that.

 last_obs_index <- new_obs_index - 1

These are Valid value indices that meet the criteria for the last observation for Treatment_ID , so the Valid value also changes in the next observation.

Then, in the for loop, when the Treatment_ID value changes the subtraction of the difference between Start and the most recent Valid value that matches our criteria. We achieve this by indicating that we want

 DF$Valid[last_obs_index[max(which(last_obs_index < i))]]

So, the for loop looks like this:

 for(i in c(2:(length(DF$Break_Months)))){ DF$Break_Months[i]=ifelse(DF$Treatment_ID[i]==DF$Treatment_ID[i-1], round(as.numeric(DF$Valid[i]-DF$Valid[i-1])/30,0),round(as.numeric(DF$Start_Date[i]-DF$Valid[last_obs_index[max(which(last_obs_index < i))]])/30,0)) }

This gives the desired result in the last observation in DF.

  Treatment_ID Start_Date Valid Break_Months 51 0031 2016-09-01 2016-11-30 1 52 0031 2016-09-01 2016-12-30 1 53 0031 2016-09-01 2017-01-30 1 54 0031 2016-09-01 2017-03-02 1 55 0031 2016-09-01 2017-03-30 1 56 0012 2017-03-01 2017-03-30 6

Thus, the necessary code for implementation is

 library(Hmisc) new_obs_index <- which(DF$Treatment_ID != Lag(DF$Treatment_ID,1) & DF$Valid != Lag(DF$Valid)) last_obs_index <- new_obs_index - 1 for(i in c(2:(length(DF$Break_Months)))){ DF$Break_Months[i]=ifelse(DF$Treatment_ID[i]==DF$Treatment_ID[i-1],round(as.numeric(DF$Valid[i]-DF$Valid[i-1])/30,0),round(as.numeric(DF$Start_Date[i]-DF$Valid[last_obs_index[max(which(last_obs_index < i))]])/30,0)) }

UPDATE For the Last1 variable Last1 you can access the desired value using the last_obs_index vector using this syntax:

 for(j in c(2:length(DF$Last1))){ DF$Last1[j]=ifelse(DF$Treatment_ID[j]==DF$Treatment_ID[j-1],DF$Last1[j-1],ifelse(DF$Treatment_ID[j]!=DF$Treatment_ID[j-1],DF$Elapsed[last_obs_index[max(which(last_obs_index < i))]],0))}

For the Last2 variable, if I understand you correctly, I think your implementation will always give the answer you want. I think that to use the Break_Months value Break_Months a new processing value Break_Months enough, and you do not need another Valid value to use it.

Calculate the date difference between rows taking into account different groups within the data frame

More articles: