Variable results with dplyr are summarized, depending on the name variable

Question

Variable results with dplyr are summarized, depending on the name variable

I use the package dplyr( dplyr0.4.3; R 3.2.3) for the base summary grouped data ( summarise), but I get conflicting results (NaN for 'sd' and incorrect count for "N"). Changing the "name" of the output has variable effects (examples below).

Summary of results:

Package
plyrnot loaded, which, as I know, can cause problems dplyrif it is loaded first.
The same results obtained with or without NA data (not shown).
The problem can be resolved by using camelCase variable naming (not shown) or by using an output variable without a name that is not an alphanumeric separator by name.
Acceptable results obtained with "." or "_" in the output column names.

Question . Although this problem can be solved, am I breaking the basic rule for naming variables that I am breaking, or is there a problem with the program that needs to be solved? I saw other issues with variable summarized behavior, but not quite like that.

thanks Matt

Data examples :

library(dplyr)
df<-data_frame(id=c(1,1,1,2,2,2,3,3,3),
       time=rep(1:3, 3),
       glucose=c(90,150, 200,
                 100,150,200,
                 80,100,150))

Example: sd gives NaN and inaccurate n

df %>% group_by(time) %>%
  summarise(glucose=mean(glucose, na.rm=TRUE),
        glucose.sd=sd(glucose, na.rm=TRUE),
        n=sum(!is.na(glucose)))

   time  glucose glucose.sd     n
  (int)    (dbl)      (dbl) (int)
1     1  90.0000        NaN     1
2     2 133.3333        NaN     1
3     3 183.3333        NaN     1

I wondered if there was a problem using ".". by name, or with the same name as in the data frame. Removing existing df col names from exit fixes this

df %>% group_by(time) %>%
  summarise(avg=mean(glucose, na.rm=TRUE),
        stdv=sd(glucose, na.rm=TRUE),
        n=sum(!is.na(glucose)))

   time      avg     stdv     n
  (int)    (dbl)    (dbl) (int)
1     1  90.0000 10.00000     3
2     2 133.3333 28.86751     3
3     3 183.3333 28.86751     3

Removing glucose reduces it to zero, even if glucose .sd is left Example: after removing glucose, the result is OK

df %>% group_by(time) %>%
  summarise(glucose.sd=sd(glucose, na.rm=TRUE),
        n=sum(!is.na(glucose)))

   time glucose.sd     n
  (int)      (dbl) (int)
1     1   10.00000     3
2     2   28.86751     3
3     3   28.86751     3

"glucose.mean" ,

df %>% group_by(time) %>%
  summarise(glucose.mean=mean(glucose, na.rm=TRUE),
            glucose.sd=sd(glucose, na.rm=TRUE),
            n=sum(!is.na(glucose)))

   time glucose.mean glucose.sd     n
  (int)        (dbl)      (dbl) (int)
1     1      90.0000   10.00000     3
2     2     133.3333   28.86751     3
3     3     183.3333   28.86751     3

"." "."

df %>% group_by(time) %>%
  summarise(glucose=mean(glucose, na.rm=TRUE),
        glucose_sd=sd(glucose, na.rm=TRUE),
        n=sum(!is.na(glucose)))

   time  glucose glucose_sd     n
  (int)    (dbl)      (dbl) (int)
1     1  90.0000        NaN     1
2     2 133.3333        NaN     1
3     3 183.3333        NaN     1

"" "_"

df %>% group_by(time) %>%
  summarise(glucose_mean=mean(glucose, na.rm=TRUE),
        glucose_sd=sd(glucose, na.rm=TRUE),
        n=sum(!is.na(glucose)))

   time glucose_mean glucose_sd     n
  (int)        (dbl)      (dbl) (int)
1     1      90.0000   10.00000     3
2     2     133.3333   28.86751     3
3     3     183.3333   28.86751     3

+4

r dplyr

Matt L. 11 . '16 20:11

1

MrFlick · Accepted Answer · 2016-02-11T20:33:07+0000

, summarize, , , , , ( tranform()).

df %>% group_by(time) %>%
  summarise(glucose=mean(glucose, na.rm=TRUE),
        glucose.sd=sd(glucose, na.rm=TRUE),
        n=sum(!is.na(glucose)))

glucose=mean(glucose, na.rm=TRUE) glucose , glucose.sd=sd(glucose, na.rm=TRUE) sd() , , . , .

df %>% group_by(time) %>%
  summarise(glucose.sd=sd(glucose, na.rm=TRUE),
        n=sum(!is.na(glucose)), 
        glucose=mean(glucose, na.rm=TRUE))

, , , , . , mutate()

df %>% group_by(time) %>%
  mutate(glucose_sq = glucose^2,
        glucose_sq_plus2 = glucose_sq+2)

Variable results with dplyr are summarized, depending on the name variable

More articles: