Should data.table CJ continue to accept arguments with duplicate elements?

(EDIT: @Arun cleared and fixed the code below and replaced โ€œCJโ€ in data.table with R-Forge. Function request: # 4849; Faster CJ and updated to Why is expand.grid faster than CJ for data.table? )

In my mind, CJ designed to work with argument vectors that satisfy anyDuplicated(vector) == F

Does anyone use it with unique arguments?

If so, is it worth aligning the 100x number of hits on what I consider to be the main?

Lower bound for speed improvement by tuning for initial use (lower bound for these sample arguments, not lower bound for any arguments):

Comparison:

 Unit: milliseconds expr min lq median uq max neval dt1 <- CJ(a, b, c) 3149.15293 3166.59638 3204.95956 3472.70826 4414.919 100 dt2 <- fastCJ(a, b, c) 22.85207 23.16003 23.43691 24.04215 1208.855 100 Output identical: TRUE 

the code:

 library(microbenchmark) library(data.table) repTE <- function(x, times, each) { rep.int(rep.int(x, times=rep.int(each, times=length(x))), times=times) } fastCJ <- function(...) { arg_list <- list(...) l <- lapply(arg_list, sort.int, method="quick") seq_ct <- length(l) if (seq_ct > 1) { seq_lens <- vapply(l, length, numeric(1)) tot_len <- prod(seq_lens) l <-lapply( seq_len(seq_ct), function(i) { if (i==1) { len <- seq_lens[1] rep.int(l[[1]], times=rep.int(tot_len/len, len)) } else if (i < seq_ct) { pre_len <- prod(seq_lens[1:(i - 1)]) repTE(l[[i]], times=pre_len, each=tot_len/pre_len/seq_lens[i]) } else { rep.int(l[[seq_ct]], times=tot_len/seq_lens[seq_ct]) } } ) } else { tot_len <- length(l[[1]]) } setattr(l, "row.names", .set_row_names(tot_len)) setattr(l, "class", c("data.table", "data.frame")) if (is.null(names <- names(arg_list))) { names <- vector("character", seq_ct) } if (any(tt <- names == "")) { names[tt] <- paste0("V", which(tt)) } setattr(l, "names", names) data.table:::settruelength(l, 0L) l <- alloc.col(l) setattr(l, "sorted", names(l)) return(l) } a <- factor(sample(1:1000, 1000)) b <- sample(letters, 26) c <- runif(100) print(microbenchmark( dt1 <- CJ(a, b, c), dt2 <- fastCJ(a, b, c))) cat("Output identical:", identical(dt1, dt2)) 
+4
source share
1 answer

For completeness, to add an answer, NEWS :

CJ () is 90% faster on 1-6 lines (for example), # 4849. Inputs are now sorted first before merging, not after merging and use rep.int instead of rep (thanks to Sean Garborg for ideas, code and benchmark) and are only sorted, if is.unsorted (), # 2321.

Reminder: CJ = Cross Join; those. connects to all combinations of its inputs.

+1
source

Source: https://habr.com/ru/post/1494365/


All Articles