I use the package dplyr( dplyr0.4.3; R 3.2.3) for the base summary grouped data ( summarise), but I get conflicting results (NaN for 'sd' and incorrect count for "N"). Changing the "name" of the output has variable effects (examples below).
Summary of results:
- Package
plyrnot loaded, which, as I know, can cause problems dplyrif it is loaded first.- The same results obtained with or without NA data (not shown).
- The problem can be resolved by using camelCase variable naming (not shown) or by using an output variable without a name that is not an alphanumeric separator by name.
- Acceptable results obtained with "." or "_" in the output column names.
Question . Although this problem can be solved, am I breaking the basic rule for naming variables that I am breaking, or is there a problem with the program that needs to be solved? I saw other issues with variable summarized behavior, but not quite like that.
thanks Matt
Data examples :
library(dplyr)
df<-data_frame(id=c(1,1,1,2,2,2,3,3,3),
time=rep(1:3, 3),
glucose=c(90,150, 200,
100,150,200,
80,100,150))
Example: sd gives NaN and inaccurate n
df %>% group_by(time) %>%
summarise(glucose=mean(glucose, na.rm=TRUE),
glucose.sd=sd(glucose, na.rm=TRUE),
n=sum(!is.na(glucose)))
time glucose glucose.sd n
(int) (dbl) (dbl) (int)
1 1 90.0000 NaN 1
2 2 133.3333 NaN 1
3 3 183.3333 NaN 1
I wondered if there was a problem using ".". by name, or with the same name as in the data frame. Removing existing df col names from exit fixes this
df %>% group_by(time) %>%
summarise(avg=mean(glucose, na.rm=TRUE),
stdv=sd(glucose, na.rm=TRUE),
n=sum(!is.na(glucose)))
time avg stdv n
(int) (dbl) (dbl) (int)
1 1 90.0000 10.00000 3
2 2 133.3333 28.86751 3
3 3 183.3333 28.86751 3
Removing glucose reduces it to zero, even if glucose .sd is left Example: after removing glucose, the result is OK
df %>% group_by(time) %>%
summarise(glucose.sd=sd(glucose, na.rm=TRUE),
n=sum(!is.na(glucose)))
time glucose.sd n
(int) (dbl) (int)
1 1 10.00000 3
2 2 28.86751 3
3 3 28.86751 3
"glucose.mean" ,
df %>% group_by(time) %>%
summarise(glucose.mean=mean(glucose, na.rm=TRUE),
glucose.sd=sd(glucose, na.rm=TRUE),
n=sum(!is.na(glucose)))
time glucose.mean glucose.sd n
(int) (dbl) (dbl) (int)
1 1 90.0000 10.00000 3
2 2 133.3333 28.86751 3
3 3 183.3333 28.86751 3
"."
"."
df %>% group_by(time) %>%
summarise(glucose=mean(glucose, na.rm=TRUE),
glucose_sd=sd(glucose, na.rm=TRUE),
n=sum(!is.na(glucose)))
time glucose glucose_sd n
(int) (dbl) (dbl) (int)
1 1 90.0000 NaN 1
2 2 133.3333 NaN 1
3 3 183.3333 NaN 1
"" "_"
df %>% group_by(time) %>%
summarise(glucose_mean=mean(glucose, na.rm=TRUE),
glucose_sd=sd(glucose, na.rm=TRUE),
n=sum(!is.na(glucose)))
time glucose_mean glucose_sd n
(int) (dbl) (dbl) (int)
1 1 90.0000 10.00000 3
2 2 133.3333 28.86751 3
3 3 183.3333 28.86751 3