I noticed only the following:
set.seed(42)
vec <- sample(c("a", "b", "c"), 1e4, replace=T)
vec_fac <- factor(vec)
vec_int <- as.integer(factor(vec))
library(microbenchmark)
microbenchmark(vec=="b", vec_fac=="b", vec_int==2, vec_fac==2)
This gives me a big surprise:
Unit: microseconds
expr min lq mean median uq max neval
vec == "b" 2397.150 2406.5925 2499.5715 2470.637 2532.628 2881.588 100
vec_fac == "b" 5706.932 5765.4340 6137.5441 6032.696 6401.567 8889.446 100
vec_int == 2 510.714 541.0935 623.8341 580.506 743.695 845.305 100
vec_fac == 2 5703.237 5772.6185 6339.6577 5975.015 6378.577 31502.869 100
I would think that factors are much more efficient than a simple symbol of a symbol, but this is not so. (Of course, vec_fac
they vec_int
take up half less memory than vec
.)
Why are factors not as effective as whole vectors?
source
share