Consider the following benchmark (R 3.4.1 on a Windows machine):
library(rbenchmark) mtx <- matrix(runif(1e8), ncol = 100) df <- as.data.frame(mtx) colnames(mtx) <- colnames(df) <- paste0("V", 1:100) benchmark( mtx[5000:7000, 80], mtx[5000:7000, "V80"], mtx[, "V80"][5000:7000], mtx[, "V80", drop = FALSE][5000:7000, ], mtx[5000:7000, , drop = FALSE][, "V80"], #mtx$V80[5000:7000], # does not apply replications = 5000 ) ## test replications elapsed relative user.self sys.self user.child sys.child ## 4 mtx[, "V80", drop = FALSE][5000:7000, ] 5000 64.71 588.273 47.44 16.61 NA NA ## 3 mtx[, "V80"][5000:7000] 5000 72.15 655.909 52.90 18.18 NA NA ## 2 mtx[5000:7000, "V80"] 5000 0.11 1.000 0.11 0.00 NA NA ## 5 mtx[5000:7000, , drop = FALSE][, "V80"] 5000 7.47 67.909 5.89 1.47 NA NA ## 1 mtx[5000:7000, 80] 5000 0.13 1.182 0.12 0.00 NA NA benchmark( df[5000:7000, 80], df[5000:7000, "V80"], df[, "V80"][5000:7000], df[, "V80", drop = FALSE][5000:7000, ], df[5000:7000, , drop = FALSE][, "V80"], df$V80[5000:7000], replications = 5000 ) ## test replications elapsed relative user.self sys.self user.child sys.child ## 6 df$V80[5000:7000] 5000 0.13 1.000 0.12 0.00 NA NA ## 4 df[, "V80", drop = FALSE][5000:7000, ] 5000 0.33 2.538 0.33 0.00 NA NA ## 3 df[, "V80"][5000:7000] 5000 0.17 1.308 0.17 0.00 NA NA ## 2 df[5000:7000, "V80"] 5000 0.15 1.154 0.16 0.00 NA NA ## 5 df[5000:7000, , drop = FALSE][, "V80"] 5000 13.63 104.846 12.91 0.39 NA NA ## 1 df[5000:7000, 80] 5000 0.19 1.462 0.17 0.00 NA NA
The time difference is quite dramatic. Why is this? What is the recommended way to subset and why? Given the benchmarks, the mtx[i, colname] for the matrix and df$colname[i] (but doesn't seem to matter much) for data.frame seems to be the most time-efficient, but are there any common reasons why we should prefer any approaches?