Variable results with dplyr are summarized, depending on the name variable

I use the package dplyr( dplyr0.4.3; R 3.2.3) for the base summary grouped data ( summarise), but I get conflicting results (NaN for 'sd' and incorrect count for "N"). Changing the "name" of the output has variable effects (examples below).

Summary of results:

  • Package
  • plyrnot loaded, which, as I know, can cause problems dplyrif it is loaded first.
  • The same results obtained with or without NA data (not shown).
  • The problem can be resolved by using camelCase variable naming (not shown) or by using an output variable without a name that is not an alphanumeric separator by name.
  • Acceptable results obtained with "." or "_" in the output column names.

Question . Although this problem can be solved, am I breaking the basic rule for naming variables that I am breaking, or is there a problem with the program that needs to be solved? I saw other issues with variable summarized behavior, but not quite like that.

thanks Matt

Data examples :

library(dplyr)
df<-data_frame(id=c(1,1,1,2,2,2,3,3,3),
       time=rep(1:3, 3),
       glucose=c(90,150, 200,
                 100,150,200,
                 80,100,150))

Example: sd gives NaN and inaccurate n

df %>% group_by(time) %>%
  summarise(glucose=mean(glucose, na.rm=TRUE),
        glucose.sd=sd(glucose, na.rm=TRUE),
        n=sum(!is.na(glucose)))

   time  glucose glucose.sd     n
  (int)    (dbl)      (dbl) (int)
1     1  90.0000        NaN     1
2     2 133.3333        NaN     1
3     3 183.3333        NaN     1

I wondered if there was a problem using ".". by name, or with the same name as in the data frame. Removing existing df col names from exit fixes this

df %>% group_by(time) %>%
  summarise(avg=mean(glucose, na.rm=TRUE),
        stdv=sd(glucose, na.rm=TRUE),
        n=sum(!is.na(glucose)))

   time      avg     stdv     n
  (int)    (dbl)    (dbl) (int)
1     1  90.0000 10.00000     3
2     2 133.3333 28.86751     3
3     3 183.3333 28.86751     3

Removing glucose reduces it to zero, even if glucose .sd is left Example: after removing glucose, the result is OK

df %>% group_by(time) %>%
  summarise(glucose.sd=sd(glucose, na.rm=TRUE),
        n=sum(!is.na(glucose)))

   time glucose.sd     n
  (int)      (dbl) (int)
1     1   10.00000     3
2     2   28.86751     3
3     3   28.86751     3

"glucose.mean" ,

df %>% group_by(time) %>%
  summarise(glucose.mean=mean(glucose, na.rm=TRUE),
            glucose.sd=sd(glucose, na.rm=TRUE),
            n=sum(!is.na(glucose)))

   time glucose.mean glucose.sd     n
  (int)        (dbl)      (dbl) (int)
1     1      90.0000   10.00000     3
2     2     133.3333   28.86751     3
3     3     183.3333   28.86751     3

"." "."

df %>% group_by(time) %>%
  summarise(glucose=mean(glucose, na.rm=TRUE),
        glucose_sd=sd(glucose, na.rm=TRUE),
        n=sum(!is.na(glucose)))

   time  glucose glucose_sd     n
  (int)    (dbl)      (dbl) (int)
1     1  90.0000        NaN     1
2     2 133.3333        NaN     1
3     3 183.3333        NaN     1

"" "_"

df %>% group_by(time) %>%
  summarise(glucose_mean=mean(glucose, na.rm=TRUE),
        glucose_sd=sd(glucose, na.rm=TRUE),
        n=sum(!is.na(glucose)))

   time glucose_mean glucose_sd     n
  (int)        (dbl)      (dbl) (int)
1     1      90.0000   10.00000     3
2     2     133.3333   28.86751     3
3     3     183.3333   28.86751     3
+4
1

, summarize, , , , , ( tranform()).

df %>% group_by(time) %>%
  summarise(glucose=mean(glucose, na.rm=TRUE),
        glucose.sd=sd(glucose, na.rm=TRUE),
        n=sum(!is.na(glucose)))

glucose=mean(glucose, na.rm=TRUE) glucose , glucose.sd=sd(glucose, na.rm=TRUE) sd() , , . , .

df %>% group_by(time) %>%
  summarise(glucose.sd=sd(glucose, na.rm=TRUE),
        n=sum(!is.na(glucose)), 
        glucose=mean(glucose, na.rm=TRUE))

, , , , . , mutate()

df %>% group_by(time) %>%
  mutate(glucose_sq = glucose^2,
        glucose_sq_plus2 = glucose_sq+2)
+3

Source: https://habr.com/ru/post/1628371/


All Articles