R summary () gives incorrect values ​​for too many NS

Customization

I have a dataset consisting of 3.5e6 1, 7.5e6 0 and 4.4e6 NA. When I find summary()on it, I get the average and maximum, which is wrong (in disagreement with mean()and max()).

> summary(data, digits = 10)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's
 0       0       1       1       1       1 4365239 

When mean()called separately, it returns a reasonable value:

> mean(data, na.rm = T)
[1] 0.6804823

Characterization of the task

This seems to be a common problem for any vector with more than 3162277 NA values ​​in it.

Trimming only:

> thingie <- as.numeric(c(rep(0,1e6), rep(1,1e6), rep(NA,3162277)))
> summary(thingie)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA 
    0.0     0.0     0.5     0.5     1.0     1.0 3162277 

And a little higher:

> thingie <- as.numeric(c(rep(0,1e6), rep(1,1e6), rep(NA,3162278)))
> summary(thingie)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA 
      0       0       0       0       1       1 3162278 

It doesn't seem to matter how many missing values ​​are there.

> thingie <- as.numeric(c(rep(0,1), rep(1,1), rep(NA,3162277)))
> summary(thingie)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA 
    0.0     0.2     0.5     0.5     0.8     1.0 3162277 
> thingie <- as.numeric(c(rep(0,1), rep(1,1), rep(NA,3162278)))
> summary(thingie)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA 
      0       0       0       0       1       1 3162278 

Study

  • Looking for an answer, I came across a known rounding error, but this does not affect this behavior (see the first code snippet).
  • , - // , . .

, , summary() mean() max(), , -, . , , , , .

: , . 1- , .

+4
1

:

x <- rep(c(1,0,NA), c(3.5e6,7.5e6,4.4e6))
out <- summary(x)
out
# Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA 
#    0       0       0       0       1       1 4400000

mean(x, na.rm=TRUE)
#[1] 0.3181818

zapsmall(), , :

c(out)
#      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA 
# 0.000e+00 0.000e+00 0.000e+00 3.182e-01 1.000e+00 1.000e+00 4.400e+06

round(c(out), max(0L, getOption("digits")-log10(4400000)))
# Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA 
#    0       0       0       0       1       1 4400000 

3162277 3162278 NA , 0 1, 0,5.

dput(max(0L,getOption("digits")-log10(3162277)))
#0.500000090664876

dput(max(0L,getOption("digits")-log10(3162278)))
#0.499999953328896

out[7] <- 3162277
out
#   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA 
#    0.0     0.0     0.0     0.3     1.0     1.0 3162277 

out[7] <- 3162278
out
#   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA 
#      0       0       0       0       1       1 3162278
+2

Source: https://habr.com/ru/post/1666566/


All Articles