Does passing a whole block of data into a function or just a few columns matter in terms of computational speed?

Question

Does passing a whole block of data into a function or just a few columns matter in terms of computational speed?

Suppose I have a data frame with many columns

ncol = 40
sample_size = 300

my_matrix <- replicate(ncol, runif(sample_size, 0, 3))
my_df <- data.frame(my_matrix)
names(my_df) <- paste0("x", 1:ncol)
epsilon <- rnorm(sample_size, 0, 0.2) 
my_df$y <- 1+3*my_df$x1 + epsilon

I pass a data frame to a function that requires only three of its columns (in my real codes, a function can use more than three columns, but I try to make everything simple here):

library(ggplot2)

idle_plotter <- function(dataframe, x_string, y_string, color_string){
    p <- ggplot(dataframe, aes_string(x = x_string, y = y_string, color = color_string)) +
        geom_point()
    print(p)
}

Does it matter in terms of speed if I pass all my_dfin idle_plotteror only need three columns idle_plotter? If the entire data frame is copied during the call, I assume that it is, but if R is a pass by reference, it should not. In my tests, this does not seem to matter, but I need to know if:

This rule, in this case I can continue to pass data frames to functions
, / . , , .

+4

parameter-passing pass-by-reference r

DeltaIV 28 . '17 16:50

1

Christoph · Accepted Answer · 2017-04-28T18:05:55+0000

, :

idle_plotter_df <- function(dataframe, x_string, y_string, color_string){
    p <- ggplot(dataframe, aes_string(x = x_string, y = y_string, color = color_string)) +
        geom_point()
    print(p)
}

idle_plotter_col <- function(x_string, y_string, color_string){
  p <- ggplot(NULL) + aes_string(x = x_string, y = y_string, color = color_string) +
    geom_point()
  print(p)
}

microbenchmark::microbenchmark(
  idle_plotter_df(my_df, "x1", "x2", "x3"),
  idle_plotter_col("my_df$x1", "my_df$x2", "my_df$x3"), times = 10L)

Unit: milliseconds
                                                 expr      min       lq     mean   median       uq      max neval
             idle_plotter_df(my_df, "x1", "x2", "x3") 168.8718 260.0504 265.3658 270.8738 272.5409 323.3371    10
 idle_plotter_col("my_df$x1", "my_df$x2", "my_df$x3") 264.6850 276.4981 293.8205 284.9820 300.3936 356.9910    10

Does passing a whole block of data into a function or just a few columns matter in terms of computational speed?

More articles: