A collection of only consecutive rows in data.table

I have a data table with approximately 30 columns and 100 million rows. The data contains several row blocks, where each row in the block has the same value in three specific columns that interest me. Here is an example where the columns that interest me are time, fruit, and color:

dt <- data.table(Time = c(100, rep(101, 4), rep(102, 2), 103:105), 
                   Ref = 1:10, 
                   Fruit = c(rep('banana', 2), 'apple', rep('banana', 2), 
                             rep('orange', 2), 'banana', rep('apple', 2)), 
                   Colour = c('green', 'yellow', 'red', rep('yellow', 2), 
                              rep('blue', 2), 'red', 'green', 'red'), 
                   Price = c(rep(1, 3), 2, 4, 3, 1, rep(5, 3)))
dt

#    Time Ref  Fruit Colour Price
# 1:  100   1 banana  green     1
# 2:  101   2 banana yellow     1
# 3:  101   3  apple    red     1
# 4:  101   4 banana yellow     2
# 5:  101   5 banana yellow     4
# 6:  102   6 orange   blue     3
# 7:  102   7 orange   blue     1
# 8:  103   8 banana    red     5
# 9:  104   9  apple  green     5
#10:  105  10  apple    red     5

This example contains two blocks. The first consists of lines of 101-banana-yellowlines 4 and 5, and the second consists of lines 102-orange-blue6 and 7. Note that even if line 2 corresponds to lines 4 and 5 in time, fruit and color, I do not want to include it as part of the block, since line 3 differs from 2, 4 and 5 and breaks the chain of consecutive matching lines.

, , , . Ref , , :

#    Time Ref  Fruit Colour Price
# 1:  100   1 banana  green     1
# 2:  101   2 banana yellow     1
# 3:  101   3  apple    red     1
# 4:  101   5 banana yellow     6
# 5:  102   7 orange   blue     4
# 6:  103   8 banana    red     5
# 7:  104   9  apple  green     5
# 8:  105  10  apple    red     5

data.table by, :

byMethod <- dt[, list(Ref = tail(Ref, 1), Price = sum(Price)), by = list(Time, Fruit, Colour)]
setcolorder(byMethod, c('Time', 'Ref', 'Fruit', 'Colour', 'Price'))
byMethod

#    Time Ref  Fruit Colour Price
# 1:  100   1 banana  green     1
# 2:  101   5 banana yellow     7
# 3:  101   3  apple    red     1
# 4:  102   7 orange   blue     4
# 5:  103   8 banana    red     5
# 6:  104   9  apple  green     5
# 7 :  105  10  apple    red     5

102-orange-blue , , 101-banana-yellow, 2 , .

- ?

+4
2

?

#create an index
dt[,i:=.I]
#group adjacent indices together
dt[, g:=cumsum(c(1, (diff(i) > 1))), by=list(Time, Fruit, Colour)]
#sum prices
dt[, list(Ref=tail(Ref, 1), Price=sum(Price)), 
   by=list(Time, Fruit, Colour, g)]

#    Time  Fruit Colour g Ref Price
# 1:  100 banana  green 1   1     1
# 2:  101 banana yellow 1   2     1
# 3:  101  apple    red 1   3     1
# 4:  101 banana yellow 2   5     6
# 5:  102 orange   blue 1   7     4
# 6:  103 banana    red 1   8     5
# 7:  104  apple  green 1   9     5
# 8:  105  apple    red 1  10     5
+4

rleid() 1.9.5, . # 686., :

7) rleid(), , . # 686. ?rleid .

:

require(data.table) ## 1.9.5+
dt[, rleid := rleid(Time, Fruit, Colour)]
dt[, .(Ref = Ref[.N], Price = sum(Price)), by=.(Time, Fruit, Colour, rleid)]
+3

Source: https://habr.com/ru/post/1524931/


All Articles