Consider the following data. event table:
library(data.table)
breaks <- data.table(id = 1:8,
Channel = c("NP1", "NP1", "NP2", "NP2", "NP3", "NP3", "AT4", "AT4"),
Time = c(1000, 1100, 975, 1075, 1010, 1080, 1000, 1050),
Day = c(1, 1, 1, 1, 1, 1, 1, 1),
ZA = c(15, 12, 4, 2, 1, 2, 23, 18),
stringsAsFactors = F)
breaks
id Channel Time Day ZA
1: 1 NP1 1000 1 15
2: 2 NP1 1100 1 12
3: 3 NP2 975 1 4
4: 4 NP2 1075 1 2
5: 5 NP3 1010 1 1
6: 6 NP3 1080 1 2
7: 7 AT4 1000 1 23
8: 8 AT4 1050 1 18
For each unique event in the breaks, I want to find the nearest events in all other channels, using the Time
where variable Day == Day
, and then sum the ZA values for these events.
This is the result I want to achieve:
id Channel Time Day ZA Sum
1: 1 NP1 1000 1 15 28
2: 2 NP1 1100 1 12 22
3: 3 NP2 975 1 4 39
4: 4 NP2 1075 1 2 32
5: 5 NP3 1010 1 1 42
6: 6 NP3 1080 1 2 32
7: 7 AT4 1000 1 23 20
8: 8 AT4 1050 1 18 19
So, for the first line, channel NP1. Closing events in all other channels before Time = 1000
- these are lines 3, 5 and 7.4+1+23 = 28
I got this to work using data.table with the following code:
breaks[breaks[, c("Day", "Time", "Channel", "ZA")], on = "Day", allow.cartesian = TRUE][
Channel != i.Channel][
order(id)][
, delta := abs(Time - i.Time)][
, .SD[delta == min(delta)], by = .(Channel, Time, Day, i.Channel)][
, unique(.SD, by = c("id", "i.Channel"))][
, .(Sum = sum(i.ZA)), by = .(id, Channel, Time, Day, ZA)]
However, this creates a data set with 64 rows in the first step, and I would like to do this with a data set of more than a million rows.
Can someone help me find a more efficient way to do this?
Edit:
. (sqldf), Eddi (data.table) MarkusN (dplyr) 1,4 39 . .
sqldf: 54 minutes
data.table: 11 hours
dplyr: 29 hours