How to combine data with unequal length depending on time with buffer intervals in R?

I am trying to combine two dataframes where the common denominator is time. However, time records may differ between them. I want to combine the two in time, but with an interval of 30 minutes in the buffer.

dataframes conceptually configured as follows:

 Data_cam <- data.frame(Start_haul=c(("31-10-2015 07:13:00"),("31-10-2015 22:40:00"),("01-11-2015 06:48:00"),("01-11-2015 16:13:00")), VesselID=c('XBBX','XBBX','XAAX','XAAX'), Species=("TOR"), Discard=c(0.28,0.96,2.92,0)) Data_sif <- data.frame(Start_haul=c(("31-10-2015 07:05:00"),("31-10-2015 07:05:00"),("31-10-2015 07:05:00"),("31-10-2015 23:05:00"),("31-10-2015 23:05:00"),("01-11-2015 06:28:00"),("01-11-2015 06:28:00"),("01-11-2015 06:28:00"),("01-11-2015 16:11:00")), VesselID=c('XBBX','XBBX','XBBX','XBBX','XBBX','XAAX','XAAX','XAAX','XAAX'),Species=("TOR"), Size_class=c("1","2","3","4","5","1","2","4","5"), Landing_kg=c(10.5,20.5,5.6,400,2,120,250,10.3,2.1)) 

This means that the first three rows in Data_sif correspond to the first row in Data_cam, and I want to add a β€œCancel” column from the first row in Data_cam to the first three rows in Data_sif. Similarly, the 4th and 5th rows in Data_sif correspond to the second row in Data_cam, and I want to add β€œCancel” here, etc. For all lines. The value in the Undo column should be repeated for each value displayed in the Class_Size column for a common timestamp.

The desired result will look like this:

 Data_combined <- data.frame(Start_haul=c(("31-10-2015 07:05:00"),("31-10-2015 07:05:00"),("31-10-2015 07:05:00"),("31-10-2015 23:05:00"),("31-10-2015 23:05:00"),("01-11-2015 06:28:00"),("01-11-2015 06:28:00"),("01-11-2015 06:28:00"),("01-11-2015 16:11:00")), VesselID=c('XBBX','XBBX','XBBX','XBBX','XBBX','XAAX','XAAX','XAAX','XAAX'),Species=("TOR"), Size_class=c("1","2","3","4","5","1","2","4","5"), Landing_kg=c(10.5,20.5,5.6,400,2,120,250,10.3,2.1), Discard=c(0.28,0.28,0.28,0.96,0.96,2.92,2.92,2.92,0)) 

I want to add more columns to the final implementation, including positional data, but for simplicity I would like to start by combining the Discard column.

I tried old posts but could not implement it for the data that I have.

+5
source share
3 answers

Here's a solution with lubridate and dplyr . This is a bit strange, but it works:

 library(lubridate) library(dplyr) Data_cam <- data.frame(Start_haul=c(("31-10-2015 07:13:00"),("31-10-2015 22:40:00"),("01-11-2015 06:48:00"),("01-11-2015 16:13:00")), VesselID=c('XBBX','XBBX','XAAX','XAAX'), Species=("TOR"), Discard=c(0.28,0.96,2.92,0)) Data_sif <- data.frame(Start_haul=c(("31-10-2015 07:05:00"),("31-10-2015 07:05:00"),("31-10-2015 07:05:00"),("31-10-2015 23:05:00"),("31-10-2015 23:05:00"),("01-11-2015 06:28:00"),("01-11-2015 06:28:00"),("01-11-2015 06:28:00"),("01-11-2015 16:11:00")), VesselID=c('XBBX','XBBX','XBBX','XBBX','XBBX','XAAX','XAAX','XAAX','XAAX'),Species=("TOR"), Size_class=c("1","2","3","4","5","1","2","4","5"), Landing_kg=c(10.5,20.5,5.6,400,2,120,250,10.3,2.1)) Data_sif %>%left_join(., Data_cam, by = "VesselID",suffix=c('_sif','_cam')) %>% mutate(buff1 = dmy_hms(Start_haul_cam) - minutes(30)) %>% mutate(buff2 = dmy_hms(Start_haul_cam) + minutes(30)) %>% filter(dmy_hms(Start_haul_sif) >= buff1 & dmy_hms(Start_haul_sif) <= buff2) %>% select(-contains('_cam')) %>% select(-contains('buff')) # Start_haul_sif VesselID Species_sif Size_class Landing_kg Discard # 1 31-10-2015 07:05:00 XBBX TOR 1 10.5 0.28 # 2 31-10-2015 07:05:00 XBBX TOR 2 20.5 0.28 # 3 31-10-2015 07:05:00 XBBX TOR 3 5.6 0.28 # 4 31-10-2015 23:05:00 XBBX TOR 4 400.0 0.96 # 5 31-10-2015 23:05:00 XBBX TOR 5 2.0 0.96 # 6 01-11-2015 06:28:00 XAAX TOR 1 120.0 2.92 # 7 01-11-2015 06:28:00 XAAX TOR 2 250.0 2.92 # 8 01-11-2015 06:28:00 XAAX TOR 4 10.3 2.92 # 9 01-11-2015 16:11:00 XAAX TOR 5 2.1 0.00 

Edit:

Or decreased slightly:

 Data_sif %>% left_join(., Data_cam, by = "VesselID",suffix=c('_sif','_cam')) %>% filter(dmy_hms(Start_haul_sif) >= dmy_hms(Start_haul_cam) - minutes(30) & dmy_hms(Start_haul_sif) <= dmy_hms(Start_haul_cam) + minutes(30)) %>% select(-contains('_cam')) 
+1
source

One solution can be achieved with sqldf .

 library(sqldf) # First convert Start_haul to Date/time Data_cam$Start_haul <- as.POSIXct(Data_cam$Start_haul, format = "%d-%m-%Y %H:%M:%S") Data_sif$Start_haul <- as.POSIXct(Data_sif$Start_haul, format = "%d-%m-%Y %H:%M:%S") # The absolute difference between Start_haul is considered as less than # 30*60 (1800 seconds) for joining. sqldf("SELECT Data_sif.Start_haul, Data_sif.VesselID, Data_sif.Species, Data_sif.Size_class, Data_sif.Landing_kg, Data_cam.Discard FROM Data_sif, Data_cam WHERE Data_sif.VesselID = Data_cam.VesselID AND Data_sif.Species = Data_cam.Species AND abs(Data_sif.Start_haul - Data_cam.Start_haul) <= 30*60 ") # Result # Start_haul VesselID Species Size_class Landing_kg Discard #1 31-10-2015 07:05:00 XBBX TOR 1 10.5 0.28 #2 31-10-2015 07:05:00 XBBX TOR 2 20.5 0.28 #3 31-10-2015 07:05:00 XBBX TOR 3 5.6 0.28 #4 31-10-2015 23:05:00 XBBX TOR 4 400.0 0.96 #5 31-10-2015 23:05:00 XBBX TOR 5 2.0 0.96 #6 01-11-2015 06:28:00 XAAX TOR 1 120.0 2.92 #7 01-11-2015 06:28:00 XAAX TOR 2 250.0 2.92 #8 01-11-2015 06:28:00 XAAX TOR 4 10.3 2.92 #9 01-11-2015 16:11:00 XAAX TOR 5 2.1 0.00 

Data

 Data_cam <- data.frame(Start_haul=c(("31-10-2015 07:13:00"),("31-10-2015 22:40:00"),("01-11-2015 06:48:00"),("01-11-2015 16:13:00")), VesselID=c('XBBX','XBBX','XAAX','XAAX'), Species=("TOR"), Discard=c(0.28,0.96,2.92,0)) Data_sif <- data.frame(Start_haul=c(("31-10-2015 07:05:00"),("31-10-2015 07:05:00"),("31-10-2015 07:05:00"),("31-10-2015 23:05:00"),("31-10-2015 23:05:00"),("01-11-2015 06:28:00"),("01-11-2015 06:28:00"),("01-11-2015 06:28:00"),("01-11-2015 16:11:00")), VesselID=c('XBBX','XBBX','XBBX','XBBX','XBBX','XAAX','XAAX','XAAX','XAAX'),Species=("TOR"), Size_class=c("1","2","3","4","5","1","2","4","5"), Landing_kg=c(10.5,20.5,5.6,400,2,120,250,10.3,2.1)) 
+1
source

Perhaps you should consider using non-equi connections from data.table as follows:

 library(data.table) setDT(Data_cam) setDT(Data_sif) #convert to POSIX datetime and create the 30mins buffer before and after Start_haul Data_cam[, Start_haul := as.POSIXct(Start_haul, format="%d-%m-%Y %H:%M:%S")][, c("BufferStart", "BufferEnd") := .(Start_haul - 30*60, Start_haul + 30*60)] Data_sif[, Start_haul := as.POSIXct(Start_haul, format="%d-%m-%Y %H:%M:%S")] #look up the Discard column using non-equi join from data.table package Data_sif[Data_cam, Discard:=Discard, on=.(VesselID, Species, Start_haul >= Start_haul, Start_haul <= BufferEnd)] 
0
source

Source: https://habr.com/ru/post/1275659/


All Articles