Lapply vs for loop - Performance R

Question

Lapply vs for loop - Performance R

It is often said that it is better lapply over for loops. There are some exceptions, as, for example, Hadley Wickham points out in his book Advance R.

( http://adv-r.had.co.nz/Functionals.html ) (change in place, recursion, etc.). The following is one of these cases.

Just for the sake of learning, I tried to rewrite the perceptron algorithm in functional form to compare relative performance. source ( https://rpubs.com/FaiHas/197581 ).

Here is the code.

 # prepare input data(iris) irissubdf <- iris[1:100, c(1, 3, 5)] names(irissubdf) <- c("sepal", "petal", "species") head(irissubdf) irissubdf$y <- 1 irissubdf[irissubdf[, 3] == "setosa", 4] <- -1 x <- irissubdf[, c(1, 2)] y <- irissubdf[, 4] # perceptron function with for perceptron <- function(x, y, eta, niter) { # initialize weight vector weight <- rep(0, dim(x)[2] + 1) errors <- rep(0, niter) # loop over number of epochs niter for (jj in 1:niter) { # loop through training data set for (ii in 1:length(y)) { # Predict binary label using Heaviside activation # function z <- sum(weight[2:length(weight)] * as.numeric(x[ii, ])) + weight[1] if (z < 0) { ypred <- -1 } else { ypred <- 1 } # Change weight - the formula doesn't do anything # if the predicted value is correct weightdiff <- eta * (y[ii] - ypred) * c(1, as.numeric(x[ii, ])) weight <- weight + weightdiff # Update error function if ((y[ii] - ypred) != 0) { errors[jj] <- errors[jj] + 1 } } } # weight to decide between the two species return(errors) } err <- perceptron(x, y, 1, 10) ### my rewriting in functional form auxiliary ### function faux <- function(x, weight, y, eta) { err <- 0 z <- sum(weight[2:length(weight)] * as.numeric(x)) + weight[1] if (z < 0) { ypred <- -1 } else { ypred <- 1 } # Change weight - the formula doesn't do anything # if the predicted value is correct weightdiff <- eta * (y - ypred) * c(1, as.numeric(x)) weight <<- weight + weightdiff # Update error function if ((y - ypred) != 0) { err <- 1 } err } weight <- rep(0, 3) weightdiff <- rep(0, 3) f <- function() { t <- replicate(10, sum(unlist(lapply(seq_along(irissubdf$y), function(i) { faux(irissubdf[i, 1:2], weight, irissubdf$y[i], 1) })))) weight <<- rep(0, 3) t }

I did not expect any consistent improvement due to the above issues. Nevertheless, I was very surprised when I saw a sharp aggravation using lapply and replicate .

I got these results using the microbenchmark function from the microbenchmark library

What could be the reason? Could this be a memory leak?

  expr min lq mean median uq f() 48670.878 50600.7200 52767.6871 51746.2530 53541.2440 perceptron(as.matrix(irissubdf[1:2]), irissubdf$y, 1, 10) 4184.131 4437.2990 4686.7506 4532.6655 4751.4795 perceptronC(as.matrix(irissubdf[1:2]), irissubdf$y, 1, 10) 95.793 104.2045 123.7735 116.6065 140.5545 max neval 109715.673 100 6513.684 100 264.858 100

The first function is the lapply / replicate function

The second is a function with for loops

The third is the same function in C++ using Rcpp

Here, according to Roland, function profiling. I'm not sure I can interpret it correctly. It looks like I spent most of my time profiling functions in a subset

+6

performance r lapply

Federico manigrasso Feb 22 '17 at 14:02

source share

1 answer

Joris meys · Accepted Answer · 2017-02-24T14:14:49+0000

First of all, this is a long-debunked myth that for loops are slower than lapply . The for loops in R were made much more efficient and are now no less fast than lapply .

However, you should reconsider your use of lapply here. Your implementation requires a global environment because your code requires updating the weight during the loop. And this is a good reason not to consider lapply .

lapply is a function that you should use for your side effects (or lack of side effects). The lapply function automatically combines the results in a list and does not interfere with the environment in which you work, unlike the for loop. The same goes for replicate . See also this question:

Is the R family more syntactic sugar applicable?

The reason your solution is lapply much slower is because your way of using it creates a lot more overhead.

replicate is nothing more than sapply internally, so you actually combine sapply and lapply to implement your double loop. sapply creates extra overhead because it needs to check if the result can be simplified. This way, the for loop will actually be faster than using replicate .
inside your anonymous lapply function, you need to access the data framework for both x and for each observation. This means that - unlike your for-loop - for example, the $ function must be called every time.
Because you use these high-end functions, your lapply solution calls 49 functions compared to your for solution, which calls only 26. These additional functions for lapply solve include function calls like match , structure , [[ , names , %in% , sys.call , duplicated , ... All functions that your for tag, as such, do not perform any checks.

If you want to find out where the extra extra overhead comes from, check out the internal code for replicate , unlist , sapply and simplify2array .

You can use the following code to better understand where you are losing your performance with lapply . Run this line by line!

 Rprof(interval = 0.0001) f() Rprof(NULL) fprof <- summaryRprof()$by.self Rprof(interval = 0.0001) perceptron(as.matrix(irissubdf[1:2]), irissubdf$y, 1, 10) Rprof(NULL) perprof <- summaryRprof()$by.self fprof$Fun <- rownames(fprof) perprof$Fun <- rownames(perprof) Selftime <- merge(fprof, perprof, all = TRUE, by = 'Fun', suffixes = c(".lapply",".for")) sum(!is.na(Selftime$self.time.lapply)) sum(!is.na(Selftime$self.time.for)) Selftime[order(Selftime$self.time.lapply, decreasing = TRUE), c("Fun","self.time.lapply","self.time.for")] Selftime[is.na(Selftime$self.time.for),]

Lapply vs for loop - Performance R

More articles: