As mentioned in my comment, for loops in R are not particularly slow. Often, however, the for loop indicates another code inefficiency. In this case, the subset operation that is repeated for each row to determine mean is most likely the slowest bit of code.
for (i in 1:dim(df)[1]) { if (is.na(df$rate[i])) { mrate <- round(mean(df$rate[df$id == df$id[i]], na.rm = T), 1)
If instead these middle groups are predefined, the loop can perform a quick search.
foo <- aggregate(df$rate, list(df$id), mean, na.rm=TRUE) for (i in 1:dim(df)[1]) { if (is.na(df$rate[i])) { mrate <- foo$x[foo$Group.1 == df$id[i]] ...
However, I am still doing a subset in df$id[i] on the big data.frame. Instead, using one of the tools that implement the split-apply-comb strategy is a good idea. In addition, let's write a function that takes a single value and a pre-calculated middle group and does the right thing:
myfun <- function(DF) { avg <- avgs$rate[avgs$id == unique(DF$id)] if (is.nan(avg)) { avg <- 1 } DF$rate[is.na(DF$rate)] <- avg return (DF) }
plyr version:
library(plyr) avgs <- ddply(df, .(id), summarise, rate=mean(rate, na.rm=TRUE)) result <- ddply(df, .(id), myfun)
And most likely the data.table version is faster:
library(data.table) DT <- data.table(df) setkey(DT, id) DT[, avg := mean(rate, na.rm=TRUE), by=id] DT[is.nan(avg), avg := 1] DT[, rate := ifelse(is.na(rate), avg, rate)]
Thus, we avoided all the subsets of substituting leiu for adding a pre-computed column, and now we can search for a series of rows that are fast and efficient. An additional column can be reset inexpensively using:
DT[, avg := NULL]
All shebang can be written to a function or data.table expression. But, IMO, this often comes at the expense of clarity!