Why are factor vectors less effective than integer or even symbol vectors?

Question

Why are factor vectors less effective than integer or even symbol vectors?

I noticed only the following:

set.seed(42)
vec <- sample(c("a", "b", "c"), 1e4, replace=T)
vec_fac <- factor(vec)
vec_int <- as.integer(factor(vec))

library(microbenchmark)
microbenchmark(vec=="b", vec_fac=="b", vec_int==2, vec_fac==2)

This gives me a big surprise:

Unit: microseconds
           expr      min        lq      mean   median       uq       max neval
     vec == "b" 2397.150 2406.5925 2499.5715 2470.637 2532.628  2881.588   100
 vec_fac == "b" 5706.932 5765.4340 6137.5441 6032.696 6401.567  8889.446   100
   vec_int == 2  510.714  541.0935  623.8341  580.506  743.695   845.305   100
   vec_fac == 2 5703.237 5772.6185 6339.6577 5975.015 6378.577 31502.869   100

I would think that factors are much more efficient than a simple symbol of a symbol, but this is not so. (Of course, vec_facthey vec_inttake up half less memory than vec.)

Why are factors not as effective as whole vectors?

+4

r

antoine-sac Oct 7 '15 at 12:35

source share

1 answer

Thierry · Answer 1 · 2015-10-07T13:01:59+0000

Testing requires some conversion. Take a look at the profiling below. Please note that (levels(vec_fac) == "b")[vec_fac]faster.

set.seed(42)
vec <- sample(c("a", "b", "c"), 1e4, replace=T)
vec_fac <- factor(vec)
vec_int <- as.integer(factor(vec))

library(microbenchmark)
microbenchmark(
  (levels(vec_fac) == "b")[vec_fac],
  vec_int == 2, 
  vec == "b", 
  vec_fac == 2,
  vec_fac == "b"
)
Unit: microseconds
                              expr     min       lq      mean   median      uq     max neval   cld
 (levels(vec_fac) == "b")[vec_fac]  62.861  69.7030  74.20981  71.8410  73.552 131.280   100 a    
                      vec_int == 2  73.124  85.0970  89.96756  86.8070  87.877 125.721   100  b   
                      vec == "b" 129.569 133.8450 138.57510 134.7005 135.129 170.621   100   c  
                      vec_fac == 2 303.611 331.8340 348.90436 334.6135 337.820 482.783   100    d 
                  vec_fac == "b" 347.656 376.7335 393.01326 379.2990 381.224 577.715   100     e

Profiling:

set.seed(42)
vec <- sample(c("a", "b", "c"), 1e8, replace=T)
vec_fac <- factor(vec)
vec_int <- as.integer(vec_fac)

Rprof()
junk <- vec_int == 2
Rprof(NULL)
summaryRprof()

Rprof()
junk <- vec == "b"
Rprof(NULL)
summaryRprof()

Rprof()
junk <- vec_fac == "b"
Rprof(NULL)
summaryRprof()

Rprof()
junk <- vec_fac == 2
Rprof(NULL)
summaryRprof()

Why are factor vectors less effective than integer or even symbol vectors?

More articles: