R is the sum of the rows for different groups of columns starting with the same row

I'm new to R, and this is the first time I dare to ask a question here.

I work with a dataset with scales for comparison, and I want to sum the sum over different groups of columns that separate the first rows in their name.

Below, I built a data frame from only two lines to illustrate the approach that I followed, although I would like to receive feedback on how I can write a more efficient way to do this.

df <- as.data.frame(rbind(rep(sample(1:5),4),rep(sample(1:5),4))) var.names <- c("emp_1","emp_2","emp_3","emp_4","sat_1","sat_2" ,"sat_3","res_1","res_2","res_3","res_4","com_1", "com_2","com_3","com_4","com_5","cap_1","cap_2", "cap_3","cap_4") names(df) <- var.names 

So, I did to use the grep function to be able to sum the lines of specified variables that started with specific lines and store them in a new variable. But I have to write a new line of code for each variable.

 df$emp_t <- rowSums(df[, grep("\\bemp.", names(df))]) df$sat_t <- rowSums(df[, grep("\\bsat.", names(df))]) df$res_t <- rowSums(df[, grep("\\bres.", names(df))]) df$com_t <- rowSums(df[, grep("\\bcom.", names(df))]) df$cap_t <- rowSums(df[, grep("\\bcap.", names(df))]) 

But there are a lot more variables in the dataset, and I would like to know if there is a way to do this with just one line of code. For example, somehow group the variables that start from the same lines together, and then apply the row function.

Thanks in advance!

+6
source share
3 answers

One possible solution is to transfer df and calculate the sums for the correct columns using the base function R rowsum (using set.seed(123) )

 cbind(df, t(rowsum(t(df), sub("_.*", "_t", names(df))))) # emp_1 emp_2 emp_3 emp_4 sat_1 sat_2 sat_3 res_1 res_2 res_3 res_4 com_1 com_2 com_3 com_4 com_5 cap_1 cap_2 cap_3 cap_4 cap_t # 1 2 4 5 3 1 2 4 5 3 1 2 4 5 3 1 2 4 5 3 1 13 # 2 1 3 4 2 5 1 3 4 2 5 1 3 4 2 5 1 3 4 2 5 14 # com_t emp_t res_t sat_t # 1 15 14 11 7 # 2 15 10 12 9 
+3
source

Agree with MrFlick that you can put your data in a long format (see reshape2 , tidyr ), but to answer your question:

 cbind( df, sapply(split.default(df, sub("_.*$", "_t", names(df))), rowSums) ) 

Will do the trick

+2
source

You will be better off in the long run if you put your data in a tidy format . The problem is that the data is in a wide rather than a long format. And variable names, such as emp_1 , actually represent two separate pieces of data: the person’s class and the person’s identification number (or something like that). Here is a solution to your problem with dplyr and tidyr.

 library(dplyr) library(tidyr) df %>% gather(key, value) %>% extract(key, c("class", "id"), "([[:alnum:]]+)_([[:alnum:]]+)") %>% group_by(class) %>% summarize(class_sum = sum(value)) 

First, we convert the data frame from wide format to gather() . Then divide the values ​​of emp_1 into separate class and id columns by extract() . Finally, we group the class and summarize the values ​​in each class. Result:

 Source: local data frame [5 x 2] class class_sum 1 cap 26 2 com 30 3 emp 23 4 res 22 5 sat 19 
+1
source

Source: https://habr.com/ru/post/987696/


All Articles