Dplyr: Graphs / Percentage of factors grouped by school not receiving grouping

I have a long data set with one row per person, grouped with schools. Each line has an ordered factor of {1, 2, 3, 4}, "cats". I want to get a percentage of 1, 2, 3 and 4 in each school. The data set is as follows:

school_number cats 1 10505 3 2 10505 3 3 10502 1 4 10502 1 5 10502 2 6 10502 1 7 10502 1 8 10502 2 10 10503 3 11 10505 2 

I tried something like this:

 df_pcts <- df %>% group_by(school_number) %>% mutate(total=sum(table(cats))) %>% summarize(cat_pct = table(cats)/total) 

but the shared variable created using the mutate () step puts the total number of lines in each line. I can’t even reach the final step. I'm confused.

PS In some other posts I saw lines like this:

 n = n() 

when I do this, I get a message that

 Error in n() : This function should not be called directly 

Where did it come from?

TIA

+6
source share
3 answers

Maybe this helps a little, although I'm not 100% sure what you need.

This counts the number of lines of each school_number / cats combination that exist in your df using tally . Then it calculates the percentage of β€œcats” in each school_number number, and then it is grouped only by the school number.

 df %>% group_by(school_number,cats) %>% tally %>% group_by(school_number) %>% mutate(pct=(100*n)/sum(n)) 

He gives the following:

  # school_number cats n pct # 1 10502 1 4 66.66667 # 2 10502 2 2 33.33333 # 3 10503 3 1 100.00000 # 4 10505 2 1 33.33333 # 5 10505 3 2 66.66667 

EDIT:

to add rows with 0% missing from your sample data, you can do the following. Associate the result above with df, which contains 0% for all school_number / cats combinations. Keep only the first instance of this binding (the first instances always contain values> 0%, if they exist). Then I organized it using school_number and cats for readability:

 y<-df %>% group_by(school_number,cats) %>% tally %>% group_by(school_number) %>% mutate(pct=(100*n)/sum(n)) %>% select(-n) x<-data.frame(school_number=rep(unique(df$school_number),each=4), cats=1:4,pct=0) rbind(y,x) %>% group_by(school_number,cats)%>% filter(row_number() == 1) %>% arrange(school_number,cats) 

which gives:

 # school_number cats pct #1 10502 1 66.66667 #2 10502 2 33.33333 #3 10502 3 0.00000 #4 10502 4 0.00000 #5 10503 1 0.00000 #6 10503 2 0.00000 #7 10503 3 100.00000 #8 10503 4 0.00000 #9 10505 1 0.00000 #10 10505 2 33.33333 #11 10505 3 66.66667 #12 10505 4 0.00000 
+9
source

All combinations of school and cat numbers, and then left, combine to calculate pct. If NA, then 0

 expand.grid(school_number = unique(df$school_number), cats = levels(df$cats)) %>% left_join(df %>% group_by(school_number, cats) %>% tally %>% mutate(pct = (n / sum(n) * 100))) %>% select(-n) %>% mutate(pct = ifelse(is.na(pct), 0, pct)) %>% arrange(school_number) 

which gives

  school_number cats pct 1 10502 1 66.66667 2 10502 2 33.33333 3 10502 3 0.00000 4 10502 4 0.00000 5 10503 1 0.00000 6 10503 2 0.00000 7 10503 3 100.00000 8 10503 4 0.00000 9 10505 1 0.00000 10 10505 2 33.33333 11 10505 3 66.66667 12 10505 4 0.00000 
0
source

As suggested by @akrun, you probably previously called the plyr and dplyr . Since summaris(z)e valids is in both packages, you can specify by adding the package before the function name ie dplyr::fun(argument...) .

0
source

Source: https://habr.com/ru/post/975383/


All Articles