Should data.table CJ continue to accept arguments with duplicate elements?

Question

Should data.table CJ continue to accept arguments with duplicate elements?

(EDIT: @Arun cleared and fixed the code below and replaced “CJ” in data.table with R-Forge. Function request: # 4849; Faster CJ and updated to Why is expand.grid faster than CJ for data.table? )

In my mind, CJ designed to work with argument vectors that satisfy anyDuplicated(vector) == F

Does anyone use it with unique arguments?

If so, is it worth aligning the 100x number of hits on what I consider to be the main?

Lower bound for speed improvement by tuning for initial use (lower bound for these sample arguments, not lower bound for any arguments):

Comparison:

 Unit: milliseconds expr min lq median uq max neval dt1 <- CJ(a, b, c) 3149.15293 3166.59638 3204.95956 3472.70826 4414.919 100 dt2 <- fastCJ(a, b, c) 22.85207 23.16003 23.43691 24.04215 1208.855 100 Output identical: TRUE

the code:

 library(microbenchmark) library(data.table) repTE <- function(x, times, each) { rep.int(rep.int(x, times=rep.int(each, times=length(x))), times=times) } fastCJ <- function(...) { arg_list <- list(...) l <- lapply(arg_list, sort.int, method="quick") seq_ct <- length(l) if (seq_ct > 1) { seq_lens <- vapply(l, length, numeric(1)) tot_len <- prod(seq_lens) l <-lapply( seq_len(seq_ct), function(i) { if (i==1) { len <- seq_lens[1] rep.int(l[[1]], times=rep.int(tot_len/len, len)) } else if (i < seq_ct) { pre_len <- prod(seq_lens[1:(i - 1)]) repTE(l[[i]], times=pre_len, each=tot_len/pre_len/seq_lens[i]) } else { rep.int(l[[seq_ct]], times=tot_len/seq_lens[seq_ct]) } } ) } else { tot_len <- length(l[[1]]) } setattr(l, "row.names", .set_row_names(tot_len)) setattr(l, "class", c("data.table", "data.frame")) if (is.null(names <- names(arg_list))) { names <- vector("character", seq_ct) } if (any(tt <- names == "")) { names[tt] <- paste0("V", which(tt)) } setattr(l, "names", names) data.table:::settruelength(l, 0L) l <- alloc.col(l) setattr(l, "sorted", names(l)) return(l) } a <- factor(sample(1:1000, 1000)) b <- sample(letters, 26) c <- runif(100) print(microbenchmark( dt1 <- CJ(a, b, c), dt2 <- fastCJ(a, b, c))) cat("Output identical:", identical(dt1, dt2))

+4

r data.table

garborg Jul 30 '13 at 19:30

source share

1 answer

Matt dowle · Accepted Answer · 2013-09-03T18:02:11+0000

For completeness, to add an answer, NEWS :

CJ () is 90% faster on 1-6 lines (for example), # 4849. Inputs are now sorted first before merging, not after merging and use rep.int instead of rep (thanks to Sean Garborg for ideas, code and benchmark) and are only sorted, if is.unsorted (), # 2321.
Reminder: CJ = Cross Join; those. connects to all combinations of its inputs.

Should data.table CJ continue to accept arguments with duplicate elements?

Comparison:

the code:

More articles: