Customization
I have a dataset consisting of 3.5e6 1, 7.5e6 0 and 4.4e6 NA. When I find summary()on it, I get the average and maximum, which is wrong (in disagreement with mean()and max()).
> summary(data, digits = 10)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0 0 1 1 1 1 4365239
When mean()called separately, it returns a reasonable value:
> mean(data, na.rm = T)
[1] 0.6804823
Characterization of the task
This seems to be a common problem for any vector with more than 3162277 NA values in it.
Trimming only:
> thingie <- as.numeric(c(rep(0,1e6), rep(1,1e6), rep(NA,3162277)))
> summary(thingie)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA
0.0 0.0 0.5 0.5 1.0 1.0 3162277
And a little higher:
> thingie <- as.numeric(c(rep(0,1e6), rep(1,1e6), rep(NA,3162278)))
> summary(thingie)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA
0 0 0 0 1 1 3162278
It doesn't seem to matter how many missing values are there.
> thingie <- as.numeric(c(rep(0,1), rep(1,1), rep(NA,3162277)))
> summary(thingie)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA
0.0 0.2 0.5 0.5 0.8 1.0 3162277
> thingie <- as.numeric(c(rep(0,1), rep(1,1), rep(NA,3162278)))
> summary(thingie)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA
0 0 0 0 1 1 3162278
Study
- Looking for an answer, I came across a known rounding error, but this does not affect this behavior (see the first code snippet).
- , - // , . .
, , summary() mean() max(), , -, . , , , , .
: , . 1- , .