Aggregating duplicate rows by taking the sum

Following my answers:
1. Determining whether a set of variables uniquely identifies each row of data or not;
2. Marking all rows that are duplicated in terms of a given set of variables.
Now I would like to combine / combine all repeating rows in terms of a given set of variables, taking their sum.

Solution 1:

There are several recommendations on how to do this here , but when there are a large number of levels of variables that form the index, the method is ddplyrecommended that this happens slowly, as was the case when I tried to mark all duplicates with a given set of variables.

# Values of (f1, f2, f3, f4) uniquely identify observations
dfUnique = expand.grid(f1 = factor(1:16),
                       f2 = factor(1:41),
                       f3 = factor(1:2),
                       f4 = factor(1:104))

# sample some extra rows and rbind them
dfDup = rbind(dfUnique, dfUnique[sample(1:nrow(dfUnique), 100), ])

# dummy data 
dfDup$data = rnorm(nrow(dfDup))

# aggregate the duplicate rows by taking the sum
dfDupAgg = ddply(dfDup, .(f1, f2, f3, f4), summarise, data = sum(data))

Solution 2:

- data.table,

# data.table solution
indexVars = paste0('f', 1:4, sep = '')
dtDup = data.table(dfDup, key = indexVars)
dtDupAgg = dtDup[, list(data = sum(data)), by = key(dtDup)]

:
1. ddply ?
2. data.table? , data.table.

+4
1

data.table, . :

indexVars = paste0('f', 1:4, sep = '')
dtDup <- as.data.table(dfDup) ## faster than data.table(.)
dtDupAgg = dtDup[, list(data = sum(data)), by = c(indexVars)]

data.table 1.9.2+ setDT, data.frames data.tables ( , , , , .).

, :

dtDup <- as.data.table(dfDup)
dtDup[...]

:

## data.table v1.9.2+
setDT(dfDup) ## faster than as.data.table(.)
dfDup[...]   ## dfDup is now a data.table, converted by reference

plyr . plyr ? ( ) .

, dplyr, , plyr, , data.table, IMHO. dplyr:

dfDup %.% group_by(f1, f2, f3, f4) %.% summarise(data = sum(data))

data.table dplyr ( ):

## data.table v1.9.2+
system.time(ans1 <- dtDup[, list(data=sum(data)), by=c(indexVars)])
#  user  system elapsed 
# 0.049   0.009   0.057 

## dplyr (commit ~1360 from github)
system.time(ans2 <- dfDup %.% group_by(f1, f2, f3, f4) %.% summarise(data = sum(data)))
#  user  system elapsed 
# 0.374   0.013   0.389 

plyr ( 93 ). , dplyr , plyr, ~ 7x , data.table .


, :

all.equal(as.data.frame(ans1[order(f1,f2,f3,f4)]), 
          as.data.frame(ans2))
# [1] TRUE

+7

Source: https://habr.com/ru/post/1535236/


All Articles