I would like to use the R dplyr package to compute the following interval related issues without using loops:
- I would like to count the observations in each interval (absolute and relative endpoints)
- I would like to summarize the observation data in each interval (absolute and relative endpoints)
The endpoints of the interval refer to the df_abs $ interval and df_rel $ interval columns. eg.
- interval: (-inf, -60]
- interval: (-60, -30]
- interval: (-30.0]
Data frames with data and intervals are as follows:
library(dplyr) # ----------{ data and interval ---------- df_data <- data.frame(varA = NA, varB = NA, varC = c(-81.0, -14.3, 29.6, 42.7, 46.4, 57.7, 15.3, 256.3, 20.3, -25.1, -23.1, -17.5)) df_abs <- data.frame(interval = c(-Inf, -60, -30, 0, 30, 60, 100, 200, Inf), count = NA, sum = NA) df_rel <- data.frame(interval = c(0,5,15,50,75,95,100), count = NA, sum = NA) # ---------- data and interval }---------- # ----------{ calculation ---------- # absolute data frame for (i in 1 : nrow(df_abs)-1) { # count observation between interval df_abs$count[i+1] <- summarise(df_data, sum(df_abs$interval[i] < varC & varC <= df_abs$interval[i+1])) # sum between interval df_abs$sum[i+1] <- sum(df_data$varC[df_abs$interval[i] < df_data$varC & df_data$varC <= df_abs$interval[i+1]]) } # relative data frame df_data_arranged <- df_data %>% arrange(varC) %>% mutate(observationPercent = c(1:nrow(df_data)) * 100/length(df_data$varC)) for (i in 1 : nrow(df_rel)-1) { # count observation between interval df_rel$count[i+1] <- summarise(df_data_arranged, sum(df_rel$interval[i] < observationPercent & observationPercent <= df_rel$interval[i+1])) # sum between interval df_rel$sum[i+1] <- sum(df_data_arranged$varC[df_rel$interval[i] < df_data_arranged$observationPercent & df_data_arranged$observationPercent <= df_rel$interval[i+1]]) } # ---------- calculation }----------
The answer should look like this:
df_abs <- data.frame(interval = c(-Inf, -60, -30, 0, 30, 60, 100, 200, Inf), count = c(0,1,0,4,3,3,0,0,1), sum = c(0,-81,0,-80,65.2,146.8,0,0,256.3)) df_rel <- data.frame(interval = c(0,5,15,50,75,95,100), count = c(0,0,1,4,3,2,1), sum = c(0,0,-81,-39.6,92.6,104.1,256.3))
As far as I understand the dplyr package, there should be a fairly short and straightforward solution for each of the two problems without the need to use loops at all.