How to combine data with unequal length depending on time with buffer intervals in R?

Question

How to combine data with unequal length depending on time with buffer intervals in R?

I am trying to combine two dataframes where the common denominator is time. However, time records may differ between them. I want to combine the two in time, but with an interval of 30 minutes in the buffer.

`dataframes` conceptually configured as follows:

 Data_cam <- data.frame(Start_haul=c(("31-10-2015 07:13:00"),("31-10-2015 22:40:00"),("01-11-2015 06:48:00"),("01-11-2015 16:13:00")), VesselID=c('XBBX','XBBX','XAAX','XAAX'), Species=("TOR"), Discard=c(0.28,0.96,2.92,0)) Data_sif <- data.frame(Start_haul=c(("31-10-2015 07:05:00"),("31-10-2015 07:05:00"),("31-10-2015 07:05:00"),("31-10-2015 23:05:00"),("31-10-2015 23:05:00"),("01-11-2015 06:28:00"),("01-11-2015 06:28:00"),("01-11-2015 06:28:00"),("01-11-2015 16:11:00")), VesselID=c('XBBX','XBBX','XBBX','XBBX','XBBX','XAAX','XAAX','XAAX','XAAX'),Species=("TOR"), Size_class=c("1","2","3","4","5","1","2","4","5"), Landing_kg=c(10.5,20.5,5.6,400,2,120,250,10.3,2.1))

This means that the first three rows in Data_sif correspond to the first row in Data_cam, and I want to add a “Cancel” column from the first row in Data_cam to the first three rows in Data_sif. Similarly, the 4th and 5th rows in Data_sif correspond to the second row in Data_cam, and I want to add “Cancel” here, etc. For all lines. The value in the Undo column should be repeated for each value displayed in the Class_Size column for a common timestamp.

The desired result will look like this:

 Data_combined <- data.frame(Start_haul=c(("31-10-2015 07:05:00"),("31-10-2015 07:05:00"),("31-10-2015 07:05:00"),("31-10-2015 23:05:00"),("31-10-2015 23:05:00"),("01-11-2015 06:28:00"),("01-11-2015 06:28:00"),("01-11-2015 06:28:00"),("01-11-2015 16:11:00")), VesselID=c('XBBX','XBBX','XBBX','XBBX','XBBX','XAAX','XAAX','XAAX','XAAX'),Species=("TOR"), Size_class=c("1","2","3","4","5","1","2","4","5"), Landing_kg=c(10.5,20.5,5.6,400,2,120,250,10.3,2.1), Discard=c(0.28,0.28,0.28,0.96,0.96,2.92,2.92,2.92,0))

I want to add more columns to the final implementation, including positional data, but for simplicity I would like to start by combining the Discard column.

I tried old posts but could not implement it for the data that I have.

+5

merge time r dataframe intervals

ltkrisp Mar 2 '18 at 8:57

source share

3 answers

Val · Answer 1 · 2018-03-02T09:37:06+0000

Here's a solution with lubridate and dplyr . This is a bit strange, but it works:

 library(lubridate) library(dplyr) Data_cam <- data.frame(Start_haul=c(("31-10-2015 07:13:00"),("31-10-2015 22:40:00"),("01-11-2015 06:48:00"),("01-11-2015 16:13:00")), VesselID=c('XBBX','XBBX','XAAX','XAAX'), Species=("TOR"), Discard=c(0.28,0.96,2.92,0)) Data_sif <- data.frame(Start_haul=c(("31-10-2015 07:05:00"),("31-10-2015 07:05:00"),("31-10-2015 07:05:00"),("31-10-2015 23:05:00"),("31-10-2015 23:05:00"),("01-11-2015 06:28:00"),("01-11-2015 06:28:00"),("01-11-2015 06:28:00"),("01-11-2015 16:11:00")), VesselID=c('XBBX','XBBX','XBBX','XBBX','XBBX','XAAX','XAAX','XAAX','XAAX'),Species=("TOR"), Size_class=c("1","2","3","4","5","1","2","4","5"), Landing_kg=c(10.5,20.5,5.6,400,2,120,250,10.3,2.1)) Data_sif %>%left_join(., Data_cam, by = "VesselID",suffix=c('_sif','_cam')) %>% mutate(buff1 = dmy_hms(Start_haul_cam) - minutes(30)) %>% mutate(buff2 = dmy_hms(Start_haul_cam) + minutes(30)) %>% filter(dmy_hms(Start_haul_sif) >= buff1 & dmy_hms(Start_haul_sif) <= buff2) %>% select(-contains('_cam')) %>% select(-contains('buff')) # Start_haul_sif VesselID Species_sif Size_class Landing_kg Discard # 1 31-10-2015 07:05:00 XBBX TOR 1 10.5 0.28 # 2 31-10-2015 07:05:00 XBBX TOR 2 20.5 0.28 # 3 31-10-2015 07:05:00 XBBX TOR 3 5.6 0.28 # 4 31-10-2015 23:05:00 XBBX TOR 4 400.0 0.96 # 5 31-10-2015 23:05:00 XBBX TOR 5 2.0 0.96 # 6 01-11-2015 06:28:00 XAAX TOR 1 120.0 2.92 # 7 01-11-2015 06:28:00 XAAX TOR 2 250.0 2.92 # 8 01-11-2015 06:28:00 XAAX TOR 4 10.3 2.92 # 9 01-11-2015 16:11:00 XAAX TOR 5 2.1 0.00

Edit:

Or decreased slightly:

 Data_sif %>% left_join(., Data_cam, by = "VesselID",suffix=c('_sif','_cam')) %>% filter(dmy_hms(Start_haul_sif) >= dmy_hms(Start_haul_cam) - minutes(30) & dmy_hms(Start_haul_sif) <= dmy_hms(Start_haul_cam) + minutes(30)) %>% select(-contains('_cam'))

MKR · Answer 2 · 2018-03-02T09:42:15+0000

One solution can be achieved with sqldf .

 library(sqldf) # First convert Start_haul to Date/time Data_cam$Start_haul <- as.POSIXct(Data_cam$Start_haul, format = "%d-%m-%Y %H:%M:%S") Data_sif$Start_haul <- as.POSIXct(Data_sif$Start_haul, format = "%d-%m-%Y %H:%M:%S") # The absolute difference between Start_haul is considered as less than # 30*60 (1800 seconds) for joining. sqldf("SELECT Data_sif.Start_haul, Data_sif.VesselID, Data_sif.Species, Data_sif.Size_class, Data_sif.Landing_kg, Data_cam.Discard FROM Data_sif, Data_cam WHERE Data_sif.VesselID = Data_cam.VesselID AND Data_sif.Species = Data_cam.Species AND abs(Data_sif.Start_haul - Data_cam.Start_haul) <= 30*60 ") # Result # Start_haul VesselID Species Size_class Landing_kg Discard #1 31-10-2015 07:05:00 XBBX TOR 1 10.5 0.28 #2 31-10-2015 07:05:00 XBBX TOR 2 20.5 0.28 #3 31-10-2015 07:05:00 XBBX TOR 3 5.6 0.28 #4 31-10-2015 23:05:00 XBBX TOR 4 400.0 0.96 #5 31-10-2015 23:05:00 XBBX TOR 5 2.0 0.96 #6 01-11-2015 06:28:00 XAAX TOR 1 120.0 2.92 #7 01-11-2015 06:28:00 XAAX TOR 2 250.0 2.92 #8 01-11-2015 06:28:00 XAAX TOR 4 10.3 2.92 #9 01-11-2015 16:11:00 XAAX TOR 5 2.1 0.00

Data

 Data_cam <- data.frame(Start_haul=c(("31-10-2015 07:13:00"),("31-10-2015 22:40:00"),("01-11-2015 06:48:00"),("01-11-2015 16:13:00")), VesselID=c('XBBX','XBBX','XAAX','XAAX'), Species=("TOR"), Discard=c(0.28,0.96,2.92,0)) Data_sif <- data.frame(Start_haul=c(("31-10-2015 07:05:00"),("31-10-2015 07:05:00"),("31-10-2015 07:05:00"),("31-10-2015 23:05:00"),("31-10-2015 23:05:00"),("01-11-2015 06:28:00"),("01-11-2015 06:28:00"),("01-11-2015 06:28:00"),("01-11-2015 16:11:00")), VesselID=c('XBBX','XBBX','XBBX','XBBX','XBBX','XAAX','XAAX','XAAX','XAAX'),Species=("TOR"), Size_class=c("1","2","3","4","5","1","2","4","5"), Landing_kg=c(10.5,20.5,5.6,400,2,120,250,10.3,2.1))

chinsoon12 · Answer 3 · 2018-03-02T10:24:03+0000

Perhaps you should consider using non-equi connections from data.table as follows:

 library(data.table) setDT(Data_cam) setDT(Data_sif) #convert to POSIX datetime and create the 30mins buffer before and after Start_haul Data_cam[, Start_haul := as.POSIXct(Start_haul, format="%d-%m-%Y %H:%M:%S")][, c("BufferStart", "BufferEnd") := .(Start_haul - 30*60, Start_haul + 30*60)] Data_sif[, Start_haul := as.POSIXct(Start_haul, format="%d-%m-%Y %H:%M:%S")] #look up the Discard column using non-equi join from data.table package Data_sif[Data_cam, Discard:=Discard, on=.(VesselID, Species, Start_haul >= Start_haul, Start_haul <= BufferEnd)]

How to combine data with unequal length depending on time with buffer intervals in R?

dataframes conceptually configured as follows:

More articles:

`dataframes` conceptually configured as follows: