Group data / bin / bucket in R and get the number per bucket and the sum of the values for each bucket

Question

Group data / bin / bucket in R and get the number per bucket and the sum of the values for each bucket

I need the data bucket / group / bin:

C1 C2 C3 49488.01172 0.0512 54000 268221.1563 0.0128 34399 34775.96094 0.0128 54444 13046.98047 0.07241 61000 2121699.75 0.00453 78921 71155.09375 0.0181 13794 1369809.875 0.00453 12312 750 0.2048 43451 44943.82813 0.0362 49871 85585.04688 0.0362 18947 31090.10938 0.0362 13401 68550.40625 0.0181 14345

I want to write it with C2 values, but I want to define buckets, for example. <= 0.005, <= 010, <= 014, etc. As you can see, balancing will be uneven intervals. I want to count C1 per bucket, as well as the total amount of C1 for each bucket.

I don’t know where to start since I am a fairly new user R. Is there anyone who wants to help me figure out the code or give me an example that will work for my needs?

EDIT: Added another C3 column. I need the amount of C3 per bucket, as well as the amount and amount of C1 per bucket

+5

r binning

Freewill Jan 4 '15 at 5:09

source share

1 answer

akrun · Accepted Answer · 2015-01-04T05:16:24+0000

From the comments, “C2” appears to be a “character” column with % as a suffix. Before creating a group, delete % with sub , convert to "numeric" ( as.numeric ). The variable "group" ( transform(df,...) ) is created using the cut function with breaks (group buckets / intervals) and labels (for the desired group labels). After creating a group variable, the sum value for “C1” by “group” and “counting” of elements within the “group” can be performed using aggregate from “base R”

 df1 <- transform(df, group=cut(as.numeric(sub('[%]', '', C2)), breaks=c(-Inf,0.005, 0.010, 0.014, Inf), labels=c('<0.005', 0.005, 0.01, 0.014))) res <- do.call(data.frame,aggregate(C1~group, df1, FUN=function(x) c(Count=length(x), Sum=sum(x)))) dNew <- data.frame(group=levels(df1$group)) merge(res, dNew, all=TRUE) # group C1.Count C1.Sum #1 <0.005 2 3491509.6 #2 0.005 NA NA #3 0.01 2 302997.1 #4 0.014 8 364609.5

or you can use data.table . setDT converts data.frame to data.table . Set the "grouping" variable with by= and sum / create the two variables "Count" and "Sum" within the list( .. .N gives the number of elements in each "group".

  library(data.table) setDT(df1)[, list(Count=.N, Sum=sum(C1)), by=group][]

Or using dplyr . %>% binds LHS with RHS arguments and combines them together. Use group_by to specify the "group" variable, and then use summarise_each or summarise to get the total counter and sum corresponding column. summarise_each would be useful if there is more than one column.

  library(dplyr) df1 %>% group_by(group) %>% summarise_each(funs(n(), Sum=sum(.)), C1)

Update

Using the new df dataset

 df1 <- transform(df, group=cut(C2, breaks=c(-Inf,0.005, 0.010, 0.014, Inf), labels=c('<0.005', 0.005, 0.01, 0.014))) res <- do.call(data.frame,aggregate(cbind(C1,C3)~group, df1, FUN=function(x) c(Count=length(x), Sum=sum(x)))) res # group C1.Count C1.Sum C3.Count C3.Sum #1 <0.005 2 3491509.6 2 91233 #2 0.01 2 302997.1 2 88843 #3 0.014 8 364609.5 8 268809

and you can do merge as described above.

The dplyr approach will be the same except specifying an additional variable

  df1%>% group_by(group) %>% summarise_each(funs(n(), Sum=sum(.)), C1, C3) #Source: local data frame [3 x 5] # group C1_n C3_n C1_Sum C3_Sum #1 <0.005 2 2 3491509.6 91233 #2 0.01 2 2 302997.1 88843 #3 0.014 8 8 364609.5 268809

data

 df <-structure(list(C1 = c(49488.01172, 268221.1563, 34775.96094, 13046.98047, 2121699.75, 71155.09375, 1369809.875, 750, 44943.82813, 85585.04688, 31090.10938, 68550.40625), C2 = c("0.0512%", "0.0128%", "0.0128%", "0.07241%", "0.00453%", "0.0181%", "0.00453%", "0.2048%", "0.0362%", "0.0362%", "0.0362%", "0.0181%")), .Names = c("C1", "C2"), row.names = c(NA, -12L), class = "data.frame")

Group data / bin / bucket in R and get the number per bucket and the sum of the values ​​for each bucket

Update

data

More articles:

Group data / bin / bucket in R and get the number per bucket and the sum of the values for each bucket