Applying the average imputation over a large subset of variables in R

Question

Applying the average imputation over a large subset of variables in R

I have a data set with 498 variables of various kinds, numeric, logical, dates, and others, and I have it as a data frame in R with rows for observations and columns for variables. There is a certain subset of these variables for which I would like to replace their missing values with the average value for this variable.

I encoded this very simple function for the average imputation:

impute.mean <- function(x) replace(x, is.na(x), mean(x, na.rm = TRUE))

And this works great if I apply to a separate variable say dataset $ variableA:

 dataset$variableA <- impute.mean(dataset$variableA)

And this gives me exactly what I want for the one variable, but since I have a fairly large subset of variables, for which I need to do this, I would not want to do this manually by going through each variable that needs to be imputed.

My first instinct was to use one of the applicable functions in R to do this efficiently, however I don't seem to understand how to do this.

First, a rude attempt was made to use the standard:

 newdataset <- apply(dataset, 2, impute.mean)

This is obviously a bit rude, as he is trying to apply this function to all columns, including variables that are not numeric, however this seemed like a reasonable starting place, even if it could generate a few warnings, Alas, this method did not work, and all mine the variables remain the same.

I also experimented a bit with lapply, mapply, ddply, but without any success.

Ideally, I would like to do something like this:

 relevantVariables <- c("variableA1", "variableA2", ..., "variableA293") newdataset <- magical.apply(dataset, relevantVariables, impute.mean)

Is there any application function that works this way?

Alternatively, is there another effective way around this?

+4

r mean

Henrik nordmark Jun 25 '13 at 12:53

source share

2 answers

Max ghenis · Answer 1 · 2014-03-29T19:51:36+0000

You can do this efficiently with the data.table package:

 SetNAsToMean <- function(dt, vars) { # Sets NA values of columns to the column means # # Args: # dt: data.table object to work with # vars: vector of column names to replace NAs # # Returns: # Nothing. Alters data.table in place. # # Example: # dt <- data.table(num1 = c(1, NA, 3), # num2 = c(NA, NA, 4), # char1 = rep("a", 3)) # SetNAsToMean(dt, c("num1", "num2")) # # Alternatively, set all numeric columns # numerics <- which(lapply(dt, class) == "numeric") # SetNAsToMean(dt, numerics) require(data.table) for (var in vars) { set(dt, which(is.na(dt[[var]])), var, mean(dt[[var]], na.rm=T)) } }

Vincent · Answer 2 · 2013-06-25T13:10:33+0000

Will it satisfy you?

 for (j in 1:length(dataset[1,])) { if (is.numeric(dataset[,j])) { for(k in 1:length(dataset[,1])) { if(is.na(dataset[k,j])) { dataset[k,j] <- mean(dataset[,j],na.rm=T) } } } }

Applying the average imputation over a large subset of variables in R

More articles: