I work with some processing information applied to animals in R First, I would like to describe the structure of my information (in the end I will add a version of dput() ). My DF data looks like this:
Treatment_ID Start_Date Valid 1 0031 2011-05-01 2011-05-30 2 0031 2011-05-01 2011-06-30 3 0045 2012-02-01 2012-03-01 4 0057 2012-04-01 2012-04-30 5 0057 2012-04-01 2012-05-30 6 0098 2012-10-01 2012-10-30
It contains 56 lines and three variables Treatment_ID (5 types of treatment), Start_Date (Date of start of treatment) and Valid (End date of treatment). For example, Treatment_ID 0031 has two observations because it started in May 2011 and ended in June 2011. Then a new appeal 0045 began in February 2012 and ended in March 2012 (only one observation). The same structure applies to all groups within the DF . I need to calculate the difference between the months between each treatment and each treatment, using some conditions. I will use the first two procedures to show this:
Treatment_ID Start_Date Valid 1 0031 2011-05-01 2011-05-30 2 0031 2011-05-01 2011-06-30 3 0045 2012-02-01 2012-03-01 4 0057 2012-04-01 2012-04-30
In this example, I have two lines for the first treatment, where the variable Treatment_ID is equal. When this happens, you need to calculate the difference in months for the Valid variable. When a new processing appears, it is necessary to calculate the difference in months between Start_Date and Valid . Note that when processing more than one case, the difference is obtained using the Valid variable for cases in this group, but when Treatment_ID changed, the difference must be obtained using the Start_Date and Valid variables. To get this Break_Months variable, I used the following structure:
DF$Break_Months=NA for(i in c(2:(length(DF$Break_Months)))) { DF$Break_Months[i]=ifelse(DF$Treatment_ID[i]==DF$Treatment_ID[i-1],round(as.numeric(DF$Valid[i]-DF$Valid[i-1])/30,0), round(as.numeric(DF$Start_Date[i]-DF$Valid[i-1])/30,0)) }
This for , when Treatment_ID equal, calculates the difference between the actual line and the previous one with the Valid variable, and when they are different, the difference is calculated using Start_Date and Valid . The first Break_Months value is NA , because there is no previous value for comparison. The problem occurred at the end of DF when I used the previous lines of code.
Treatment_ID Start_Date Valid Break_Months 47 0098 2012-10-01 2016-07-30 1 48 0098 2012-10-01 2016-08-31 1 49 0031 2016-09-01 2016-09-30 0 50 0031 2016-09-01 2016-10-30 1 51 0031 2016-09-01 2016-11-30 1 52 0031 2016-09-01 2016-12-30 1 53 0031 2016-09-01 2017-01-30 1 54 0031 2016-09-01 2017-03-02 1 55 0031 2016-09-01 2017-03-30 1 56 0012 2017-03-01 2017-03-30 -1
Treatment_ID 0012 has only one case, because it is new, and the Valid date is the date of the last treatment observation 0031 . Since Treatment_ID 0031 used in other months, then the difference is calculated using previous observations within the treatment. In the case of 0012 this is not possible, since the Valid date coincides with the fact that the last observation 0031 and 0012 does not contain more observations because it is new. When this happens, a comparison must be made with the last observation of the group preceding 0031 , this is 0098 . Using the term 0012 not equal to 0098 Break_Months calculated by the difference between 2017-03-01 ( Start_Date ) and 2016-08-31 ( Valid ), giving the value 6 by the same mechanic in the for structure, and not -1 .
My question is how to include this consideration in for . It was very difficult to try to do this, because I do not know how to integrate the comparison related to the date (if they are equal, as in the example), and look for the previous group up to the one that contains the same date. I tried using the lag function from the dplyr package to avoid for , but the results do not match. The version of dput() DF follows:
DF<-structure(list(Treatment_ID = c("0031", "0031", "0045", "0057", "0057", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", "0031", "0031", "0031", "0031", "0031", "0031", "0031", "0012"), Start_Date = structure(c(1304208000, 1304208000, 1328054400, 1333238400, 1333238400, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1472688000, 1472688000, 1472688000, 1472688000, 1472688000, 1472688000, 1472688000, 1488326400), class = c("POSIXct", "POSIXt"), tzone = "UTC"), Valid = structure(c(1306713600, 1309392000, 1330560000, 1335744000, 1338336000, 1351555200, 1354233600, 1356825600, 1359504000, 1362182400, 1364601600, 1367280000, 1369872000, 1372550400, 1375142400, 1377820800, 1380499200, 1383091200, 1385769600, 1388361600, 1391040000, 1393718400, 1396137600, 1398816000, 1401408000, 1404086400, 1412035200, 1414627200, 1417305600, 1419897600, 1422576000, 1425254400, 1427673600, 1432944000, 1435622400, 1440892800, 1443571200, 1446163200, 1448841600, 1451433600, 1454112000, 1456790400, 1459296000, 1461974400, 1464566400, 1467244800, 1469836800, 1472601600, 1475193600, 1477785600, 1480464000, 1483056000, 1485734400, 1488412800, 1490832000, 1490832000), class = c("POSIXct", "POSIXt"), tzone = "UTC")), .Names = c("Treatment_ID", "Start_Date", "Valid"), row.names = c(NA, -56L), class = "data.frame")
Thank you for your help.
Update One of these solutions worked perfectly. Now I have a little problem when I have to calculate similar variables. First, I compute the Elapsed variable, i.e. the difference between Valid and Start_Date . I am using the following code:
DF$Elapsed=round(as.numeric(DF$Valid-DF$Start_Date)/30,0)
Then a dilemma arises. I need to calculate the following two variables Last1 and Last2 . For this, I use the following code:
#Compute Last1 DF$Last1=NA DF$Last1[1]=0 for(j in c(2:length(DF$Last1))) { DF$Last1[j]=ifelse(DF$Treatment_ID[j]==DF$Treatment_ID[j-1],DF$Last1[j-1], ifelse(DF$Treatment_ID[j]!=DF$Treatment_ID[j-1],DF$Elapsed[j-1],0)) }
The code worked in parts because I have a similar problem with the Break_Months variable. In this case, since 0031 and 0012 have the same Valid value, the Treatment_ID comparison should not be performed with the last value 0031 , where 7 is assigned due to the loop logic ( Elapsed variable). In this case, the correct value is 48, because the comparison should be structured with the last observation of group 0098 , then we will get 48. I tried to change using last_obs_index , but I can not get the correct result.
Treatment_ID Start_Date Valid Break_Months Elapsed Last1 47 0098 2012-10-01 2016-07-30 1 47 2 48 0098 2012-10-01 2016-08-31 1 48 2 49 0031 2016-09-01 2016-09-30 0 1 48 50 0031 2016-09-01 2016-10-30 1 2 48 51 0031 2016-09-01 2016-11-30 1 3 48 52 0031 2016-09-01 2016-12-30 1 4 48 53 0031 2016-09-01 2017-01-30 1 5 48 54 0031 2016-09-01 2017-03-02 1 6 48 55 0031 2016-09-01 2017-03-30 1 7 48 56 0012 2017-03-01 2017-03-30 6 1 7
For Last2 variable Last2 I use the following code:
#Compute Last2 DF$Last2=NA DF$Last2[1]=0 for(k in c(2:length(DF$Last2))) { DF$Last2[k]=ifelse(DF$Treatment_ID[k]==DF$Treatment_ID[k-1],DF$Last2[k-1], ifelse(DF$Treatment_ID[k]!=DF$Treatment_ID[k-1],DF$Break_Months[k],0)) }
In this case, it seems to have worked, but it is not. Although 6 is true, the comparison is not very well defined, since 0012 and 0031 have the same Valid date, and the optimal observation uses the latest observation from group 0098 . Therefore, the value of Break_Months . Again, I could not fix the loop with the correct logic defined with last_obs_index .
Treatment_ID Start_Date Valid Break_Months Elapsed Last1 Last2 47 0098 2012-10-01 2016-07-30 1 47 2 4 48 0098 2012-10-01 2016-08-31 1 48 2 4 49 0031 2016-09-01 2016-09-30 0 1 48 0 50 0031 2016-09-01 2016-10-30 1 2 48 0 51 0031 2016-09-01 2016-11-30 1 3 48 0 52 0031 2016-09-01 2016-12-30 1 4 48 0 53 0031 2016-09-01 2017-01-30 1 5 48 0 54 0031 2016-09-01 2017-03-02 1 6 48 0 55 0031 2016-09-01 2017-03-30 1 7 48 0 56 0012 2017-03-01 2017-03-30 6 1 7 6
Thanks for all the help this time, is it possible to get tips on how to adapt the loops to properly compare comparisons.