(EDIT: @Arun cleared and fixed the code below and replaced โCJโ in data.table with R-Forge. Function request: # 4849; Faster CJ and updated to Why is expand.grid faster than CJ for data.table? )
In my mind, CJ designed to work with argument vectors that satisfy anyDuplicated(vector) == F
Does anyone use it with unique arguments?
If so, is it worth aligning the 100x number of hits on what I consider to be the main?
Lower bound for speed improvement by tuning for initial use (lower bound for these sample arguments, not lower bound for any arguments):
Comparison:
Unit: milliseconds expr min lq median uq max neval dt1 <- CJ(a, b, c) 3149.15293 3166.59638 3204.95956 3472.70826 4414.919 100 dt2 <- fastCJ(a, b, c) 22.85207 23.16003 23.43691 24.04215 1208.855 100 Output identical: TRUE
the code:
library(microbenchmark) library(data.table) repTE <- function(x, times, each) { rep.int(rep.int(x, times=rep.int(each, times=length(x))), times=times) } fastCJ <- function(...) { arg_list <- list(...) l <- lapply(arg_list, sort.int, method="quick") seq_ct <- length(l) if (seq_ct > 1) { seq_lens <- vapply(l, length, numeric(1)) tot_len <- prod(seq_lens) l <-lapply( seq_len(seq_ct), function(i) { if (i==1) { len <- seq_lens[1] rep.int(l[[1]], times=rep.int(tot_len/len, len)) } else if (i < seq_ct) { pre_len <- prod(seq_lens[1:(i - 1)]) repTE(l[[i]], times=pre_len, each=tot_len/pre_len/seq_lens[i]) } else { rep.int(l[[seq_ct]], times=tot_len/seq_lens[seq_ct]) } } ) } else { tot_len <- length(l[[1]]) } setattr(l, "row.names", .set_row_names(tot_len)) setattr(l, "class", c("data.table", "data.frame")) if (is.null(names <- names(arg_list))) { names <- vector("character", seq_ct) } if (any(tt <- names == "")) { names[tt] <- paste0("V", which(tt)) } setattr(l, "names", names) data.table:::settruelength(l, 0L) l <- alloc.col(l) setattr(l, "sorted", names(l)) return(l) } a <- factor(sample(1:1000, 1000)) b <- sample(letters, 26) c <- runif(100) print(microbenchmark( dt1 <- CJ(a, b, c), dt2 <- fastCJ(a, b, c))) cat("Output identical:", identical(dt1, dt2))