R: Extreme grouping of random values ​​from runif with Mersenne-Twister seed

We encounter a strange situation in our code when using R runif and setting the seed with set.seed using kind = NULL (which allows, if I am mistaken, to kind = "default" , by default it is "Mersenne-Twister" ).

We set the seed using (8 digits) the unique identifiers generated by the upstream system before calling runif :

 seeds = c( "86548915", "86551615", "86566163", "86577411", "86584144", "86584272", "86620568", "86724613", "86756002", "86768593", "86772411", "86781516", "86794389", "86805854", "86814600", "86835092", "86874179", "86876466", "86901193", "86987847", "86988080") random_values = sapply(seeds, function(x) { set.seed(x) y = runif(1, 17, 26) return(y) }) 

It gives values extremely , grouped together.

 > summary(random_values) Min. 1st Qu. Median Mean 3rd Qu. Max. 25.13 25.36 25.66 25.58 25.83 25.94 

This runif behavior disappears when we use kind = "Knuth-TAOCP-2002" and get values ​​that look much more evenly distributed.

 random_values = sapply(seeds, function(x) { set.seed(x, kind = "Knuth-TAOCP-2002") y = runif(1, 17, 26) return(y) }) 

Exit omitted.


The most interesting thing is that this does not happen on Windows - it only happens on Ubuntu output ( sessionInfo for Ubuntu and Windows below).

Windows output:

 > seeds = c( + "86548915", "86551615", "86566163", "86577411", "86584144", + "86584272", "86620568", "86724613", "86756002", "86768593", "86772411", + "86781516", "86794389", "86805854", "86814600", "86835092", "86874179", + "86876466", "86901193", "86987847", "86988080") > > random_values = sapply(seeds, function(x) { + set.seed(x) + y = runif(1, 17, 26) + return(y) + }) > > summary(random_values) Min. 1st Qu. Median Mean 3rd Qu. Max. 17.32 20.14 23.00 22.17 24.07 25.90 

Can someone help to understand what is happening?

Ubuntu

 R version 3.4.0 (2017-04-21) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04.2 LTS Matrix products: default BLAS: /usr/lib/libblas/libblas.so.3.6.0 LAPACK: /usr/lib/lapack/liblapack.so.3.6.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=en_US.UTF-8 [9] LC_ADDRESS=en_US.UTF-8 LC_TELEPHONE=en_US.UTF-8 [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=en_US.UTF-8 attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] RMySQL_0.10.8 DBI_0.6-1 [3] jsonlite_1.4 tidyjson_0.2.2 [5] optiRum_0.37.3 lubridate_1.6.0 [7] httr_1.2.1 gdata_2.18.0 [9] XLConnect_0.2-12 XLConnectJars_0.2-12 [11] data.table_1.10.4 stringr_1.2.0 [13] readxl_1.0.0 xlsx_0.5.7 [15] xlsxjars_0.6.1 rJava_0.9-8 [17] sqldf_0.4-10 RSQLite_1.1-2 [19] gsubfn_0.6-6 proto_1.0.0 [21] dplyr_0.5.0 purrr_0.2.4 [23] readr_1.1.1 tidyr_0.6.3 [25] tibble_1.3.0 tidyverse_1.1.1 [27] rBayesianOptimization_1.1.0 xgboost_0.6-4 [29] MLmetrics_1.1.1 caret_6.0-76 [31] ROCR_1.0-7 gplots_3.0.1 [33] effects_3.1-2 pROC_1.10.0 [35] pscl_1.4.9 lattice_0.20-35 [37] MASS_7.3-47 ggplot2_2.2.1 loaded via a namespace (and not attached): [1] splines_3.4.0 foreach_1.4.3 AUC_0.3.0 modelr_0.1.0 [5] gtools_3.5.0 assertthat_0.2.0 stats4_3.4.0 cellranger_1.1.0 [9] quantreg_5.33 chron_2.3-50 digest_0.6.10 rvest_0.3.2 [13] minqa_1.2.4 colorspace_1.3-2 Matrix_1.2-10 plyr_1.8.4 [17] psych_1.7.3.21 XML_3.98-1.7 broom_0.4.2 SparseM_1.77 [21] haven_1.0.0 scales_0.4.1 lme4_1.1-13 MatrixModels_0.4-1 [25] mgcv_1.8-17 car_2.1-5 nnet_7.3-12 lazyeval_0.2.0 [29] pbkrtest_0.4-7 mnormt_1.5-5 magrittr_1.5 memoise_1.0.0 [33] nlme_3.1-131 forcats_0.2.0 xml2_1.1.1 foreign_0.8-69 [37] tools_3.4.0 hms_0.3 munsell_0.4.3 compiler_3.4.0 [41] caTools_1.17.1 rlang_0.1.1 grid_3.4.0 nloptr_1.0.4 [45] iterators_1.0.8 bitops_1.0-6 tcltk_3.4.0 gtable_0.2.0 [49] ModelMetrics_1.1.0 codetools_0.2-15 reshape2_1.4.2 R6_2.2.0 [53] knitr_1.15.1 KernSmooth_2.23-15 stringi_1.1.5 Rcpp_0.12.11 

Window

 > sessionInfo() R version 3.3.2 (2016-10-31) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200) locale: [1] LC_COLLATE=English_India.1252 LC_CTYPE=English_India.1252 LC_MONETARY=English_India.1252 [4] LC_NUMERIC=C LC_TIME=English_India.1252 attached base packages: [1] graphics grDevices utils datasets grid stats methods base other attached packages: [1] bindrcpp_0.2 h2o_3.14.0.3 ggrepel_0.6.5 eulerr_1.1.0 VennDiagram_1.6.17 [6] futile.logger_1.4.3 scales_0.4.1 FinCal_0.6.3 xml2_1.0.0 httr_1.3.0 [11] wesanderson_0.3.2 wordcloud_2.5 RColorBrewer_1.1-2 htmltools_0.3.6 urltools_1.6.0 [16] timevis_0.4 dtplyr_0.0.1 magrittr_1.5 shiny_1.0.5 RODBC_1.3-14 [21] zoo_1.8-0 sqldf_0.4-10 RSQLite_1.1-2 gsubfn_0.6-6 proto_1.0.0 [26] gdata_2.17.0 stringr_1.2.0 XLConnect_0.2-12 XLConnectJars_0.2-12 data.table_1.10.4 [31] xlsx_0.5.7 xlsxjars_0.6.1 rJava_0.9-8 readxl_0.1.1 googlesheets_0.2.1 [36] jsonlite_1.5 tidyjson_0.2.1 RMySQL_0.10.9 RPostgreSQL_0.4-1 DBI_0.5-1 [41] dplyr_0.7.2 purrr_0.2.3 readr_1.1.1 tidyr_0.7.0 tibble_1.3.3 [46] ggplot2_2.2.0 tidyverse_1.0.0 lubridate_1.6.0 loaded via a namespace (and not attached): [1] gtools_3.5.0 assertthat_0.2.0 triebeard_0.3.0 cellranger_1.1.0 yaml_2.1.14 [6] slam_0.1-40 lattice_0.20-34 glue_1.1.1 chron_2.3-48 digest_0.6.12.1 [11] colorspace_1.3-1 httpuv_1.3.5 plyr_1.8.4 pkgconfig_2.0.1 xtable_1.8-2 [16] lazyeval_0.2.0 mime_0.5 memoise_1.0.0 tools_3.3.2 hms_0.3 [21] munsell_0.4.3 lambda.r_1.1.9 rlang_0.1.1 RCurl_1.95-4.8 labeling_0.3 [26] bitops_1.0-6 tcltk_3.3.2 gtable_0.2.0 reshape2_1.4.2 R6_2.2.0 [31] bindr_0.1 futile.options_1.0.0 stringi_1.1.2 Rcpp_0.12.12.1 
+5
source share
3 answers

NOTE. . This answer summarizes the elements of the discussion on this issue that took place on the R-devel mailing list. I was just trying to capture and summarize the ideas originally formulated there.

Despite your assurances that these numbers are not a specially designed edge case, they have everything there is. Here is the source sequence plus code to check the distribution of the resulting values:

 seeds = c( 86548915, 86551615, 86566163, 86577411, 86584144, 86584272, 86620568, 86724613, 86756002, 86768593, 86772411, 86781516, 86794389, 86805854, 86814600, 86835092, 86874179, 86876466, 86901193, 86987847, 86988080) checkit <- function(seeds) { sapply(seeds, function(x) { set.seed(x) y = runif(1, 17, 26) return(y) })} 

The original sequence shows a surprisingly small change, as noted:

  summary(checkit(seeds+0)) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ##25.13 25.36 25.66 25.58 25.83 25.94 

It seems that there is something special in the original sequence, since minimal modifications to it do not lead to the same unexpected result:

 summary(checkit(seeds+1)) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 17.18 19.65 22.75 22.02 24.37 25.79 summary(checkit(seeds-1)) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ##17.15 18.44 19.92 20.77 22.97 25.95 

Of all the seeds in the range covered by the original sequence, the expected number produces values ​​in the observed range:

 possible.seeds <- min(seeds):max(seeds) s25 <- Filter(function(s){ set.seed(s) x <- runif(1,17,26) x > 25.12 & x < 25.95}, possible.seeds) length(s25)/length(possible.seeds) ##[1] 0.09175801 

Nevertheless, all values ​​in the original sequence are in this subset (of course, we already knew that ...).

 table(seeds %in% s25) ##TRUE ## 21 

Everything indicates the probable possibility that the original sequence was actually (possibly unintentionally) a specially designed edge case.

+2
source

When you use the Mersenne Twister with one seed, it is reasonable to assume that the generated values ​​are approximately independent and equally distributed. Unfortunately, there are no guarantees regarding the generated values ​​from two threads running on different seeds. See, for example, this SC stream .

In your situation, I would recommend either using one of the seed selection strategies proposed in the SC stream, or switching to PRNG with the best guarantees for parallel flows. One option is the L'Ecuyer "RngStreams" generator:

 set.seed(0, kind = "L'Ecuyer-CMRG") 

Even with this PRNG, I don’t know if it continues that you can sow PRNG with arbitrary seeds and get rough independent streams.

Regarding the difference between Ubuntu and Windows, it is possible that one of these systems uses a 32-bit generator and the other uses a 64-bit version.

+1
source

As additional evidence that your sequence is edge-based, you can focus on a range of supposedly random values. 17 and 26 are a little distracting. Repeating your experiment uniformly at 0 and 1 generates something just as unlikely:

 f <- function(x) { set.seed(x) runif(1) } check_range <-function(seeds){ vals <- sapply(seeds,f) max(vals)-min(vals) } 

When working with seeds:

 > check_range(seeds) [1] 0.09026112 

A reasonable model for check_range(seeds) when running on 21 random seeds is that it is a sample of a random sample of size 21 drawn by U(0,1) . Its theoretical density is determined as follows:

 f <- function(x){420*x^19*(1-x)} 

We can use this to calculate the probability of observing a range of 0.09 or less:

 > integrate(f,0,0.09) 2.334272e-20 with absolute error < 2.6e-34 

As a check that it is reasonable to simulate a sampling range when sowing a Mersenne Twister, you can do the following experiment:

 ranges <- replicate(1000,check_range(sample(8548915:86988080,21))) x <- seq(0,1,0.01) y <- f(x) hist(ranges,freq = FALSE,xlim =c(0,1)) points(x,y,type = "l") abline(v=0.09) 

Output:

enter image description here

The density histogram follows well the theoretical density. A set of 21 seeds from your question is an extreme outlier. This is very unlikely due to chance, and it is unlikely to be due to some manifestation in the Mersenne Twister. The most likely explanation is that Mersenne Twister himself was involved in creating these 21 values ​​(but certainly not naive, just using sample() to draw 21 values).

+1
source

Source: https://habr.com/ru/post/1273084/


All Articles