Is the unit less efficient than cycles?

I tried to perform this operation on a large table to count rows with various combinations of a and b in the data table.

Y <- aggregate(c ~ a+b,X,length)

And it went on forever (I stopped after 30 minutes), although RAM usage was still there.

Then I tried to manually loop through the values ​​of b and aggregate only on (technically still aggregating over b, but with one value of b each time):

sub_agg <- list()
unique_bs <- unique(X$b)
for (b_it in unique_bs){
sub_agg[[length(sub_agg)+1]] <- aggregate(c ~ a + b,subset(X, b == b_it),length)
}
Y <- do.call(rbind, sub_agg )

And I finished in 3 minutes.

I can also go further and completely get rid of the aggregate and perform operations only with subsets.

Is an aggregate less efficient than nested loops and operations on subsets, or is it a special case?

, , , , , .

:

X 20

50 b

15 000

+4
2

, , , :

  • , . aggregate . , aggregate , O (n) .
  • aggregate expand.grid , a b. aggregate.data.frame. , .
  • edit: , .

, aggregate. Y, table:

thecounts <- with(X, table(a,b))
Y <- as.data.frame(thecounts)

, , aggregate. 68 , ...

Benchmark:

        test replications elapsed relative 
1  aggloop()            1   15.03   68.318 
2 tableway()            1    0.22    1.000 

(, , R ):

nrows <- 20e5

X <- data.frame(
  a = factor(sample(seq_len(15e2), nrows, replace = TRUE)),
  b = factor(sample(seq_len(50), nrows, replace = TRUE)),
  c = 1
)

aggloop <- function(){
sub_agg <- list()
unique_bs <- unique(X$b)
for (b_it in unique_bs){
  sub_agg[[length(sub_agg)+1]] <- aggregate(c ~ a + b,subset(X, b == b_it),length)
}
Y <- do.call(rbind, sub_agg )
}

tableway <- function(){
  thecounts <- with(X, table(a,b))
  Y <- as.data.frame(thecounts)
}

library(rbenchmark)

benchmark(aggloop(),
          tableway(),
          replications = 1
          )
+5

@JorisMeys , , , - , .

data.table DT, data.table: DT[i, j, by], " DT, , i, j, .". , a b X : X[, .N, by=c("a", "b")].

data.table .

data.table , X , JorisMeys:

library(data.table)
X2 <- copy(X) # taking a copy of X so the conversion to data.table does not impact the initial data

dtway <- function(){
            setDT(X2)[, .N, by=c("a", "b")] # setDT permits to convert X2 into a data.table
         }

library(rbenchmark)
benchmark(aggloop(),
          tableway(),
          dtway(),
          replications = 1)

        # test replications elapsed relative
# 1  aggloop()            1   17.29  192.111
# 3    dtway()            1    0.09    1.000
# 2 tableway()            1    0.27    3.000

: , X ( ) 1/2,5 1/3,5 data.table base R table.

0

Source: https://habr.com/ru/post/1671616/


All Articles