Replace NA with the average match of the same identifier

I have a data frame:

id <- c(rep(1, 4), rep(2, 3), rep(3, 2), 4) rate <- c(rep(1, 3), NA, 0.5, 0.6, NA, 0.7, NA, NA) df <- data.frame(id, rate) 

and I need to replace NA with the following conditions:

 for (i in 1:dim(df)[1]) { if (is.na(df$rate[i])) { mrate <- round(mean(df$rate[df$id == df$id[i]], na.rm = T), 1) if (is.nan(mrate)) { df$rate[i] <- 1 } else { df$rate[i] <- mrate } } } 

The for loop seems to be too slow in a large data frame s> 200K lines. How can I use a much faster way without using a for loop?

Thanks!

+4
source share
3 answers

This is a solution using data.table s:

 library(data.table) dt <- data.table( df, key = "id" ) dt[ , rate := ifelse( is.na(rate), round( mean(rate, na.rm=TRUE), 1), rate ), by = id ] dt[ is.na(rate), rate := 1 ] dt id rate 1: 1 1.0 2: 1 1.0 3: 1 1.0 4: 1 1.0 5: 2 0.5 6: 2 0.6 7: 2 0.6 8: 3 0.7 9: 3 0.7 10: 4 1.0 

I am not sure though if ifelse can / should be avoided.

+5
source

As mentioned in my comment, for loops in R are not particularly slow. Often, however, the for loop indicates another code inefficiency. In this case, the subset operation that is repeated for each row to determine mean is most likely the slowest bit of code.

 for (i in 1:dim(df)[1]) { if (is.na(df$rate[i])) { mrate <- round(mean(df$rate[df$id == df$id[i]], na.rm = T), 1) ## This line! if (is.nan(mrate)) { df$rate[i] <- 1 } else { df$rate[i] <- mrate } } } 

If instead these middle groups are predefined, the loop can perform a quick search.

 foo <- aggregate(df$rate, list(df$id), mean, na.rm=TRUE) for (i in 1:dim(df)[1]) { if (is.na(df$rate[i])) { mrate <- foo$x[foo$Group.1 == df$id[i]] ... 

However, I am still doing a subset in df$id[i] on the big data.frame. Instead, using one of the tools that implement the split-apply-comb strategy is a good idea. In addition, let's write a function that takes a single value and a pre-calculated middle group and does the right thing:

 myfun <- function(DF) { avg <- avgs$rate[avgs$id == unique(DF$id)] if (is.nan(avg)) { avg <- 1 } DF$rate[is.na(DF$rate)] <- avg return (DF) } 

plyr version:

  library(plyr) avgs <- ddply(df, .(id), summarise, rate=mean(rate, na.rm=TRUE)) result <- ddply(df, .(id), myfun) 

And most likely the data.table version is faster:

  library(data.table) DT <- data.table(df) setkey(DT, id) DT[, avg := mean(rate, na.rm=TRUE), by=id] DT[is.nan(avg), avg := 1] DT[, rate := ifelse(is.na(rate), avg, rate)] 

Thus, we avoided all the subsets of substituting leiu for adding a pre-computed column, and now we can search for a series of rows that are fast and efficient. An additional column can be reset inexpensively using:

 DT[, avg := NULL] 

All shebang can be written to a function or data.table expression. But, IMO, this often comes at the expense of clarity!

+4
source

I'm not sure if this answers the OP question exactly, but for others who read it later, there is a different and much faster way to do calculations on a subset of data other than the actual subset of data: vector math. Engineers in the crowd find out what I'm talking about.

Instead of a subset, assign a very fast function to create an identity vector and multiply the data by identity.

Now it is not faster for all cases. There are times when vectorized functions are actually slower than elementary functions, and it all depends on your specific application. [Insert the O-notation attachment of your choice here.]

Here's how we will perform a vector mathematical implementation for this case:

 # Create the NA identity vector. na_identity <- is.na(df$rate) # Initialize the final data frame. # This is for non-destructive purposes. df_revised <- df # Replace all NA occurrences in final # data frame with zero values. df_revised$rate[na_identity] <- 0 # Loop through each unique [id] # value in the data. # Create an identity vector for the # current ID, calculate the mean # rate for that ID (replacing NaN with 1), # and insert the mean for any NA values # associated with that ID. for (i in unique(df$id)){ id_identity <- df$id==i id_mean <- sum(df_revised$rate * id_identity * !na_identity) / sum(id_identity * !na_identity) if(is.nan(id_mean)){id_mean <- 1} df_revised$rate <- df_revised$rate + id_mean * id_identity * na_identity } # id rate # 1 1 1.00 # 2 1 1.00 # 3 1 1.00 # 4 1 1.00 # 5 2 0.50 # 6 2 0.60 # 7 2 0.55 # 8 3 0.70 # 9 3 0.70 # 10 4 1.00 

In a vector mathematical perspective, this code is easy to read. In this small example, the code is very fast, but the cycle time increases directly with the number of unique ID values. I'm not sure if this is the right approach for a larger OP application, but the solution is workable and theoretically sound and eliminates the need for complex and hard to read logical blocks.

+3
source

Source: https://habr.com/ru/post/1482189/


All Articles