A more efficient way to count the frequency in columns of a data frame

Question

A more efficient way to count the frequency in columns of a data frame

I have some survey data in which columns correspond to elements and rows, correspond to clients telling how likely they are to buy each element. Looks like that:

item1 = c("Likely", "Unlikely", "Very Likely","Likely") item2 = c("Likely", "Unlikely", "Very Likely","Unlikely") item3 = c("Very Likely", "Unlikely", "Very Likely","Likely") df = data.frame(item1, item2, item3)

I want the pivot table to give the percentage of each response for each item. Right now I am using table () for each column for this process, and its a lot of code to control. How to do this using plyr or apply or something faster?

Current solution:

 d1<-as.data.frame(table(df$item1)) d1$item1_percent<- d1$Freq/sum(d1$Freq) names(d1)<-c("Response","item1_freqs","item1_percent") d2<-as.data.frame(table(df$item2)) d2$item2_percent<- d2$Freq/sum(d2$Freq) names(d2)<-c("Response","item2_freqs","item2_percent") d3<-as.data.frame(table(df$item3)) d3$item3_percent<- d3$Freq/sum(d3$Freq) names(d3)<-c("Response","item3_freqs","item3_percent") results<-cbind(d1,d2[,2:3],d3[,2:3])

Note. I really don't need frequency values, just percentages.

Thanks in advance!

+5

r dplyr

Sarahgc Jun 15 '17 at 19:28

source share

4 answers

Using dplyr :

 results = data.frame(df %>% group_by(item1) %>% summarise(no_rows=length(item1)/nrow(df))) results = cbind(results, data.frame(df %>% group_by(item2) %>% summarise(no_rows=length(item2)/nrow(df)))) results = cbind(results, data.frame(df %>% group_by(item3) %>% summarise(no_rows=length(item3)/nrow(df)))) # > results # item1 no_rows item2 no_rows item3 no_rows # 1 Likely 0.50 Likely 0.25 Likely 0.25 # 2 Unlikely 0.25 Unlikely 0.50 Unlikely 0.25 # 3 Very Likely 0.25 Very Likely 0.25 Very Likely 0.50

+2

Matt Jun 15 '17 at 19:38

source share

Consider a merge chain with Reduce , where you first scroll each column of data by number using lapply to create a list of data that is then passed to merge in Response:

 dfList <- lapply(seq_along(df), function(i){ d <- as.data.frame(table(df[,i])) d$item1_percent <- d$Freq/sum(d$Freq) # PASS COLUMN NUMBER INTO DF COLUMN NAMES names(d) <- c("Response", paste0("item",i,"_freqs"), paste0("item",i,"_percent")) return(d) }) results2 <- Reduce(function(x,y) merge(x, y, by="Response", all.equal=TRUE), dfList) # EQUIVALENT TO ORIGINAL results all.equal(results, results2) # [1] TRUE identical(results, results2) # [1] TRUE

+2

Parfait Jun 15 '17 at 19:48

source share

I would suggest using a different way of organizing data using factor levels to distinguish between elements. This makes working with data easier. I will convert your data using the collection function, and then use the summation to calculate the percentage values of the frequency:

 library(tidyverse) results <- df %>% gather("item", "likelihood") %>% group_by(item, likelihood) %>% summarise(n = n() ) %>% mutate(freq = n / sum(n)) # > results # A tibble: 9 x 4 # Groups: item [3] # item likelihood n freq # <chr> <chr> <int> <dbl> # 1 item1 Likely 2 0.50 # 2 item1 Unlikely 1 0.25 # 3 item1 Very Likely 1 0.25 # 4 item2 Likely 1 0.25 # 5 item2 Unlikely 2 0.50 # 6 item2 Very Likely 1 0.25 # 7 item3 Likely 1 0.25 # 8 item3 Unlikely 1 0.25 # 9 item3 Very Likely 2 0.50

I used dplyr and a broom for this, but I prefer to use the tidyverse library, as it loads both packages at the same time.

Edit: if you want to use frequencies as columns, you can use a spread to do this:

 col_results <- results %>% select(-n) %>% spread(item, freq) # > col_results # A tibble: 3 x 4 # likelihood item1 item2 item3 # * <chr> <dbl> <dbl> <dbl> # 1 Likely 0.50 0.25 0.25 # 2 Unlikely 0.25 0.50 0.25 # 3 Very Likely 0.25 0.25 0.50

+2

Sollano Rabelo Braga Jun 15 '17 at 19:51

source share

user20650 · Accepted Answer · 2017-06-15T19:50:18+0000

Since you have the same range of values in each # element, you can use

 sapply(df, function(x) prop.table(table(x))) # item1 item2 item3 # Likely 0.50 0.25 0.25 # Unlikely 0.25 0.50 0.25 # Very Likely 0.25 0.25 0.50

But if they were different, you can set a common set of levels for each element #

 df[] <- lapply(df, factor, levels=unique(unlist(df))) sapply(df, function(x) prop.table(table(x)))

A more efficient way to count the frequency in columns of a data frame

More articles: