I have a data table with approximately 30 columns and 100 million rows. The data contains several row blocks, where each row in the block has the same value in three specific columns that interest me. Here is an example where the columns that interest me are time, fruit, and color:
dt <- data.table(Time = c(100, rep(101, 4), rep(102, 2), 103:105),
Ref = 1:10,
Fruit = c(rep('banana', 2), 'apple', rep('banana', 2),
rep('orange', 2), 'banana', rep('apple', 2)),
Colour = c('green', 'yellow', 'red', rep('yellow', 2),
rep('blue', 2), 'red', 'green', 'red'),
Price = c(rep(1, 3), 2, 4, 3, 1, rep(5, 3)))
dt
This example contains two blocks. The first consists of lines of 101-banana-yellowlines 4 and 5, and the second consists of lines 102-orange-blue6 and 7. Note that even if line 2 corresponds to lines 4 and 5 in time, fruit and color, I do not want to include it as part of the block, since line 3 differs from 2, 4 and 5 and breaks the chain of consecutive matching lines.
, , , . Ref , , :
data.table by, :
byMethod <- dt[, list(Ref = tail(Ref, 1), Price = sum(Price)), by = list(Time, Fruit, Colour)]
setcolorder(byMethod, c('Time', 'Ref', 'Fruit', 'Colour', 'Price'))
byMethod
102-orange-blue , , 101-banana-yellow, 2 , .
- ?