Time difference between different subset methods for data.frame and matrix objects

Consider the following benchmark (R 3.4.1 on a Windows machine):

library(rbenchmark) mtx <- matrix(runif(1e8), ncol = 100) df <- as.data.frame(mtx) colnames(mtx) <- colnames(df) <- paste0("V", 1:100) benchmark( mtx[5000:7000, 80], mtx[5000:7000, "V80"], mtx[, "V80"][5000:7000], mtx[, "V80", drop = FALSE][5000:7000, ], mtx[5000:7000, , drop = FALSE][, "V80"], #mtx$V80[5000:7000], # does not apply replications = 5000 ) ## test replications elapsed relative user.self sys.self user.child sys.child ## 4 mtx[, "V80", drop = FALSE][5000:7000, ] 5000 64.71 588.273 47.44 16.61 NA NA ## 3 mtx[, "V80"][5000:7000] 5000 72.15 655.909 52.90 18.18 NA NA ## 2 mtx[5000:7000, "V80"] 5000 0.11 1.000 0.11 0.00 NA NA ## 5 mtx[5000:7000, , drop = FALSE][, "V80"] 5000 7.47 67.909 5.89 1.47 NA NA ## 1 mtx[5000:7000, 80] 5000 0.13 1.182 0.12 0.00 NA NA benchmark( df[5000:7000, 80], df[5000:7000, "V80"], df[, "V80"][5000:7000], df[, "V80", drop = FALSE][5000:7000, ], df[5000:7000, , drop = FALSE][, "V80"], df$V80[5000:7000], replications = 5000 ) ## test replications elapsed relative user.self sys.self user.child sys.child ## 6 df$V80[5000:7000] 5000 0.13 1.000 0.12 0.00 NA NA ## 4 df[, "V80", drop = FALSE][5000:7000, ] 5000 0.33 2.538 0.33 0.00 NA NA ## 3 df[, "V80"][5000:7000] 5000 0.17 1.308 0.17 0.00 NA NA ## 2 df[5000:7000, "V80"] 5000 0.15 1.154 0.16 0.00 NA NA ## 5 df[5000:7000, , drop = FALSE][, "V80"] 5000 13.63 104.846 12.91 0.39 NA NA ## 1 df[5000:7000, 80] 5000 0.19 1.462 0.17 0.00 NA NA 

The time difference is quite dramatic. Why is this? What is the recommended way to subset and why? Given the benchmarks, the mtx[i, colname] for the matrix and df$colname[i] (but doesn't seem to matter much) for data.frame seems to be the most time-efficient, but are there any common reasons why we should prefer any approaches?

+5
source share
1 answer

The main reason is the data structures R behind the matrices and data.frames. A matrix is ​​basically an object with column numbers x columns (mostly numeric) (by default, the matrix is ​​not sparse by default) and a measurement property. For this reason, your first 2 teams

 mtx[5000:7000, 80], mtx[5000:7000, "V80"] 

extract again the matrices for which R not only assigns values, but also the dimension, creating new matrix objects instead of simple vectors, which are the default objects of R.

On the other hand, data.frame in R is, by definition, a special type of list object, where the length of each column object must be the same, while columns can contain different types of variables (numeric, string, etc.). Matrices can contain only one type variable, which by default will be the most common. In this way,

 df[5000:7000, 80] 

retrieves the vector of the 80th column, and then the values ​​at position 5000-7000 from this. A vector is much easier to process for R than a matrix object, and therefore it is much faster.

If you select drop = FALSE, you force R to not work with a simple vector object when you select the 80th column, but instead treat it as a data.frame / list object. Lists are the most common and flexible type of R objects, since there are no restrictions on their size and records, but this happens at the price that they are more difficult and laborious to process, as you can see when comparing

 mtx[5000:7000, , drop = FALSE][, "V80"] df[5000:7000, , drop = FALSE][, "V80"] 

From the data frame, you get another data.frame / list file, while the matrix still returns a matrix that is processed even faster than the list.

+1
source

Source: https://habr.com/ru/post/1271729/


All Articles