Mutate multiple / sequential columns (with dplyr or R base)

I am trying to create “waves” of variables that are repetitive measures. In particular, I am trying to create sequential variables that represent the average values ​​for variables 1-10, 11-20 ... 91-100. Please note that the symbol "..." symbolizes the variables for waves 3 through 9, as avoiding entering this data is my goal!

Here is an example df data frame with 10 rows and 100 columns:

 mat <- matrix(runif(1000, 1, 10), ncol = 100) df <- data.frame(mat) dim(df) > 10 100 

I used the dplyr mutate function, which works after entering all the variables, but is time consuming and error prone. I could not find a way to do this without resorting to manually typing the column names, as I began to do below (note that “...” symbolizes waves 3 through 9):

 df <- df %>% mutate(wave_1 = (X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10) / 10, wave_2 = (X11 + X12 + X13 + X14 + X15 + X16 + X17 + X18 + X19 + X20) / 10, ... wave_10 = (X91 + X92 + X93 + X94 + X95 + X96 + X97 + X98 + X99 + X100) / 10) 

Can you mutate mutate multiple / consecutive columns with 'dplyr'? Other approaches are also welcome.

+7
source share
3 answers

Here is one way with the zoo package:

 library(zoo) t(rollapply(t(df), width = 10, by = 10, function(x) sum(x)/10)) 

Here is one way to do this with an R base:

 splits <- 1:100 dim(splits) <- c(10, 10) splits <- split(splits, col(splits)) results <- do.call("cbind", lapply(splits, function(x) data.frame(rowSums(df[,x] / 10)))) names(results) <- paste0("wave_", 1:10) results 

Another very concise path with the base R (kindly provided by G. Grothendieck):

 t(apply(df, 1, tapply, gl(10, 10), mean)) 

And here is the solution with dplyr and tidyr :

 library(dplyr) library(tidyr) df$row <- 1:nrow(df) df2 <- df %>% gather(column, value, -row) df2$column <- cut(as.numeric(gsub("X", "", df2$column)),breaks = c(0:10*10)) df2 <- df2 %>% group_by(row, column) %>% summarise(value = sum(value)/10) df2 %>% spread(column, value) %>% select(-row) 
+6
source

Another dplyr solution that is a bit closer to the syntax specified by the OP and does not require redrawing the data frame.

rowSums calculations basically do the same thing, but in a slightly different, but vectorized (i.e. rowSums and rowMeans ) ways:

 df <- df %>% mutate(wave_1 = rowSums(select(., num_range("X", 1:10)))/10, wave_2 = rowSums(select(., c(11:20)))/10, wave_3 = rowMeans(select(., X21:X30)), wave_4 = rowMeans(.[, 31:40])) 

Edit:. can be used as a placeholder for the current df data frame (code has been changed accordingly). Wave_4 is also added to demonstrate that it can be used as a data frame.

If the working function is not vectorized (that is, it cannot be used on the entire data frame, for example rowSums ), it is also possible to use the rowwise and do function using non-vectorized functions (for example, myfun )

 myfun <- function (x) { sum(x)/10 } tmp=df %>% rowwise() %>% do(data.frame(., wave_1 = myfun(unlist(.)[1:10]))) %>% do(data.frame(., wave_2 = myfun(unlist(.)[11:20]))) 

Note: The changes seem to change its meaning, referring to the entire data frame for mutate but only to the current line for do .

+3
source

Another approach (and the IMO recommended approach) using dplyr is to first modify or convert your data into a neat data format before summing the values ​​for each wave.

In detail, this process will include:

  1. Change your data to long format ( tidyr::gather )
  2. Determine which variables belong to each “wave”
  3. Sum the values ​​for each wave
  4. Recover your data back to wide format ( tidyr::spread )

In your example, it will look like this:

 library(tidyverse) mat <- matrix(runif(1000, 1, 10), ncol = 100) df <- data.frame(mat) dim(df) df %>% dplyr::mutate(id = dplyr::row_number()) %>% # reshape to "tidy data" or long format tidyr::gather(varname, value, -id) %>% # identify which variables belong to which "wave" dplyr::mutate(varnum = as.integer(stringr::str_extract(varname, pattern = '\\d+')), wave = floor((varnum-1)/10)+1) %>% # summarize your value for each wave dplyr::group_by(id, wave) %>% dplyr::summarise(avg = sum(value)/n()) %>% # reshape back to "wide" format tidyr::spread(wave, avg, sep='_') %>% dplyr::ungroup() 

With the following conclusion:

 # A tibble: 10 x 11 id wave_1 wave_2 wave_3 wave_4 wave_5 wave_6 wave_7 wave_8 wave_9 wave_10 <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 1 6.24 4.49 5.85 5.43 5.98 6.04 4.83 6.92 5.43 5.52 2 2 5.16 6.82 5.76 6.66 6.21 5.41 4.58 5.06 5.81 6.93 3 3 7.23 6.28 5.40 5.70 5.13 6.27 5.55 5.84 6.74 5.94 4 4 5.27 4.79 4.39 6.85 5.31 6.01 6.15 3.31 5.73 5.63 5 5 6.48 5.16 5.20 4.71 5.87 4.44 6.40 5.00 5.90 3.78 6 6 4.18 4.64 5.49 5.47 5.75 6.35 4.34 5.66 5.34 6.57 7 7 4.97 4.09 6.17 5.78 5.87 6.47 4.96 4.39 5.99 5.35 8 8 5.50 7.21 5.43 5.15 4.56 5.00 4.86 5.72 6.41 5.65 9 9 5.27 5.71 5.23 5.44 5.12 5.40 5.38 6.05 5.41 5.30 10 10 5.95 4.58 6.52 5.46 7.63 5.56 5.82 7.03 5.68 5.38 

This can be attached to your source data to match your example (which used mutate ) as follows:

 df %>% dplyr::mutate(id = dplyr::row_number()) %>% tidyr::gather(varname, value, -id) %>% dplyr::mutate(varnum = as.integer(stringr::str_extract(varname, pattern = '\\d+')), wave = floor((varnum-1)/10)+1) %>% dplyr::group_by(id, wave) %>% dplyr::summarise(avg = sum(value)/n()) %>% tidyr::spread(wave, avg, sep='_') %>% dplyr::ungroup() %>% dplyr::right_join(df %>% # <-- join back to original data dplyr::mutate(id = dplyr::row_number()), by = 'id') 

One of the nice aspects of this approach is that you can check your data to make sure that you assign variables to the wave correctly.

 df %>% dplyr::mutate(id = dplyr::row_number()) %>% tidyr::gather(varname, value, -id) %>% dplyr::mutate(varnum = as.integer(stringr::str_extract(varname, pattern = '\\d+')), wave = floor((varnum-1)/10)+1) %>% dplyr::distinct(varname, varnum, wave) %>% head() 

which produces:

  varname varnum wave 1 X1 1 1 2 X2 2 1 3 X3 3 1 4 X4 4 1 5 X5 5 1 6 X6 6 1 
0
source

Source: https://habr.com/ru/post/1238742/


All Articles