Fast conversion from a long data frame to a wide array

I have an old problem caused by the size of the dataset. The problem is converting the data frame from a long to a wide matrix:

set.seed(314)
A <- data.frame(field1 = sample(letters, 10, replace=FALSE), 
    field2 = sample(toupper(letters), 10, replace=FALSE), 
    value=1:10)

B <- with(A, tapply(value, list(field1, field2), sum))

This can also be done with the old change in the R base, or better in plyr and reshape2. In plyr:

daply(A, .(field1, field2), sum)

In reshape2:

dcast(A, field1 ~ field2, sum)

The problem is that there are 30 + m rows in the data frame, at least 5000 unique values ​​for fields 1 and 20,000 for field2. At this size, plyr falls, reshape2 sometimes falls, and slicing is very slow. The machine is not a limitation (48 GB, 50% use and 8 core Xeon). What is the best practice for this task?

N.B.: - . , . , , dcast.data.table, . data.table - .

+4
2

FWIW, , data.table ( ).

( : dcast.data.table @BenBolker, . , , ).

( ):

require(data.table) ## >= 1.9.2
set.seed(1L)
N = 30e6L
DT <- data.table(field1 = sample(paste0("F1_", 1:5000), N, TRUE), 
                 field2 = sample(paste0("F2_", 1:20000), N, TRUE),
                 value  = sample(10))

> tables()
#      NAME       NROW  MB COLS                KEY
# [1,] DT   30,000,000 574 field1,field2,value
# Total: 574MB

:

system.time(ans <- DT[, list(value=sum(value)), by=list(field1, field2)])
#   user  system elapsed
# 15.097   3.357  18.454

( :) @BenBolker () ( cast ing):

system.time({
    rlabs <- sort(unique(ans$field1))
    clabs <- sort(unique(ans$field2))
    fans <- matrix(NA,length(rlabs),length(clabs),
              dimnames=list(rlabs,clabs))
    fans[as.matrix(ans[,1:2, with=FALSE])] <- ans$value
})
#   user  system elapsed
# 18.630   1.524  20.154
+3

, ? ( , NA, ...)

rlabs <- sort(unique(A$field1))
clabs <- sort(unique(A$field2))
B <- matrix(NA,length(rlabs),length(clabs),
      dimnames=list(rlabs,clabs))
B[as.matrix(A[,1:2])] <- A[,3]

, , , value...

+2

Source: https://habr.com/ru/post/1540817/


All Articles