Dplyr :: mutate gives x / y = NA, summation gives x / y = real number

I am working on a function test to calculate the transmission rates for a specific criterion in my laboratory. The math behind this is very simple: given the number of tests that either passed or failed, what percentage passed.

The data will be provided as a column of values ​​that are either P1 (transmitted during the first test), F1 (failed during the first test), P2 or F2 (passed or failed in the second test, respectively). I wrote the passRate function below to help in calculating the total passing speeds (first and second attempts) and in the first test and second test in isolation.

The quality specialist, who set up the parameters for verification, gave me a list of passes and failures that I convert to a vector using the test_vector function below.

Everything looked fine until I got to the third row of the Pass data frame, which contains the pass / fail data from my quality specialist. Instead of returning a second test pass rate of 100%, it returns NA ... but only when I use mutate

 library(dplyr) Pass <- structure(list(P1 = c(2L, 0L, 10L), F1 = c(0L, 2L, 0L), P2 = c(0L, 3L, 2L), F2 = c(0L, 2L, 0L), id = 1:3), .Names = c("P1", "F1", "P2", "F2", "id"), class = c("tbl_df", "data.frame"), row.names = c(NA, -3L)) 

So, here is something similar to what I did with mutate .

 Pass %>% group_by(id) %>% mutate(pass_rate = (P1 + P2) / (P1 + P2 + F1 + F2) * 100, pass_rate1 = P1 / (P1 + F1) * 100, pass_rate2 = P2 / (P2 + F2) * 100) Source: local data frame [3 x 8] Groups: id [3] P1 F1 P2 F2 id pass_rate pass_rate1 pass_rate2 (int) (int) (int) (int) (int) (dbl) (dbl) (dbl) 1 2 0 0 0 1 100.00000 100 NA 2 0 2 3 2 2 42.85714 0 60 3 10 0 3 1 3 100.00000 100 NA 

Compare when I use summarise

 Pass %>% group_by(id) %>% summarise(pass_rate = (P1 + P2) / (P1 + P2 + F1 + F2) * 100, pass_rate1 = P1 / (P1 + F1) * 100, pass_rate2 = P2 / (P2 + F2) * 100) Source: local data frame [3 x 4] id pass_rate pass_rate1 pass_rate2 (int) (dbl) (dbl) (dbl) 1 1 100.00000 100 NA 2 2 42.85714 0 60 3 3 100.00000 100 100 

I would expect them to return the same results. I suppose mutate has problems somewhere, because it assumes that n lines for each group should appear in n lines as a result (does this get confused when calculating n here?), And summarise knows that no matter how many lines starts with it, it ends only 1.

Does anyone have any thoughts on what mechanics are behind this behavior?

+5
source share
1 answer

It seems to me that there is some interference between dplyr and plyr . I had the same problem with another unbalanced dataset (so it was necessary to group) where exactly in the third group the changed variable was mistakenly NA! Then I reproduced your example at home. First after

 library("dplyr", lib.loc="~/R/x86_64-pc-linux-gnu-library/3.2") 

I got exactly your results. Then I plyr my own script where the plyr package was downloaded. After the warning, do not load plyr after dplyr , NA was not in my third group, and your example was calculated correctly! Here's what I did (I added another line to see if NA will remain in the third group):

 > Pass <- structure(list(P1 = c(2L, 0L, 10L,8L), + F1 = c(0L, 2L, 0L, 4L), + P2 = c(0L, 3L, 2L, 2L), + F2 = c(0L, 2L, 0L, 1L), + id = 1:4), + .Names = c("P1", "F1", "P2", "F2", "id"), + class = c("tbl_df", "data.frame"), + row.names = c(NA, -4L)) > Pass %>% + group_by(id) %>% + mutate(pass_rate = (P1 + P2) / (P1 + P2 + F1 + F2) * 100, + pass_rate1 = P1 / (P1 + F1) * 100, + pass_rate2 = P2 / (P2 + F2) * 100) Source: local data frame [4 x 8] Groups: id [4] P1 F1 P2 F2 id pass_rate pass_rate1 pass_rate2 (int) (int) (int) (int) (int) (dbl) (dbl) (dbl) 1 2 0 0 0 1 100.00000 100.00000 NA 2 0 2 3 2 2 42.85714 0.00000 60.00000 3 10 0 2 0 3 100.00000 100.00000 NA 4 8 4 2 1 4 66.66667 66.66667 66.66667 

Then I did:

 > library("plyr", lib.loc="~/R/x86_64-pc-linux-gnu-library/3.2") > Pass %>% + group_by(id) %>% + mutate(pass_rate = (P1 + P2) / (P1 + P2 + F1 + F2) * 100, + pass_rate1 = P1 / (P1 + F1) * 100, + pass_rate2 = P2 / (P2 + F2) * 100) Source: local data frame [4 x 8] Groups: id [4] P1 F1 P2 F2 id pass_rate pass_rate1 pass_rate2 (int) (int) (int) (int) (int) (dbl) (dbl) (dbl) 1 2 0 0 0 1 100.00000 100.00000 NaN 2 0 2 3 2 2 42.85714 0.00000 60.00000 3 10 0 2 0 3 100.00000 100.00000 100.00000 4 8 4 2 1 4 66.66667 66.66667 66.66667 

I know this is not a satisfactory answer because plyr should NOT load after dplyr , but maybe it helps those who need group_by(id) . Or use plyr::mutate() . Then you can load dplyr after plyr :

  > Pass %>% + group_by(id) %>% + plyr::mutate(pass_rate = (P1 + P2) / (P1 + P2 + F1 + F2) * 100, + pass_rate1 = P1 / (P1 + F1) * 100, + pass_rate2 = P2 / (P2 + F2) * 100) Source: local data frame [4 x 8] Groups: id [4] P1 F1 P2 F2 id pass_rate pass_rate1 pass_rate2 (int) (int) (int) (int) (int) (dbl) (dbl) (dbl) 1 2 0 0 0 1 100.00000 100.00000 NaN 2 0 2 3 2 2 42.85714 0.00000 60.00000 3 10 0 2 0 3 100.00000 100.00000 100.00000 4 8 4 2 1 4 66.66667 66.66667 66.66667 
+3
source

Source: https://habr.com/ru/post/1233640/


All Articles