How to create a more efficient simulation cycle for Monte Carlo in R

The goal of this exercise is to create a population distribution of nutrient intake values. In the previous data, measures were repeated, they were deleted, so each row is a unique person in the data frame.

I have this code that works well when testing with a small amount of data lines. For all 7135 lines, this is very slow. I tried the time, but I broke it when the elapsed time on my machine was 15 hours. system.time results were Timing stopped at: 55625.08 2985.39 58673.87 .

I would appreciate any comments regarding simulation acceleration:

 Male.MC <-c() for (j in 1:100) { for (i in 1:nrow(Male.Distrib)) { u2 <- Male.Distrib$stddev_u2[i] * rnorm(1, mean = 0, sd = 1) mc_bca <- Male.Distrib$FixedEff[i] + u2 temp <- Lambda.Value*mc_bca+1 ginv_a <- temp^(1/Lambda.Value) d2ginv_a <- max(0,(1-Lambda.Value)*temp^(1/Lambda.Value-2)) mc_amount <- ginv_a + d2ginv_a * Male.Resid.Var / 2 z <- data.frame( RespondentID = Male.Distrib$RespondentID[i], Subgroup = Male.Distrib$Subgroup[i], mc_amount = mc_amount, IndvWeight = Male.Distrib$INDWTS[i]/100 ) Male.MC <- as.data.frame(rbind(Male.MC,z)) } } 

For each of 7135 observations in my dataset, 100 simulated nutrient values ​​are created, then converted back to the original measurement level (the simulation uses the results of the nonlinear mixed effect model with converted BoxCox nutrient values).

I would prefer not to use for loops, as I read that they are inefficient in R , but I don't understand enough about apply based options to use them as an alternative. R runs on standalone machines, usually it will be a standard Dell desktop running Windows 7 if it affects recommendations for changing the code.

Update: To reproduce this for testing, Lambda.Value = 0.4 and Male.Resid.Var = 12.1029420429778 and Male.Distrib$stddev_u2 is a constant value for all observations.

str(Male.Distrib)

 'data.frame': 7135 obs. of 14 variables: $ RndmEff : num 1.34 -5.86 -3.65 2.7 3.53 ... $ RespondentID: num 9966 9967 9970 9972 9974 ... $ Subgroup : Ord.factor w/ 6 levels "3"<"4"<"5"<"6"<..: 4 3 2 4 1 4 2 5 1 2 ... $ RespondentID: int 9966 9967 9970 9972 9974 9976 9978 9979 9982 9993 ... $ Replicates : num 41067 2322 17434 21723 375 ... $ IntakeAmt : num 33.45 2.53 9.58 43.34 55.66 ... $ RACE : int 2 3 2 2 3 2 2 2 2 1 ... $ INDWTS : num 41067 2322 17434 21723 375 ... $ TOTWTS : num 1.21e+08 1.21e+08 1.21e+08 1.21e+08 1.21e+08 ... $ GRPWTS : num 41657878 22715139 10520535 41657878 10791729 ... $ NUMSUBJECTS : int 1466 1100 1424 1466 1061 1466 1424 1252 1061 1424 ... $ TOTSUBJECTS : int 7135 7135 7135 7135 7135 7135 7135 7135 7135 7135 ... $ FixedEff : num 6.09 6.76 7.08 6.09 6.18 ... $ stddev_u2 : num 2.65 2.65 2.65 2.65 2.65 ... 

head(Male.Distrib) is

  RndmEff RespondentID Subgroup RespondentID Replicates IntakeAmt RACE INDWTS TOTWTS GRPWTS NUMSUBJECTS TOTSUBJECTS FixedEff stddev_u2 1 1.343753 9966 6 9966 41067 33.449808 2 41067 120622201 41657878 1466 7135 6.089918 2.645938 2 -5.856516 9967 5 9967 2322 2.533528 3 2322 120622201 22715139 1100 7135 6.755664 2.645938 3 -3.648339 9970 4 9970 17434 9.575439 2 17434 120622201 10520535 1424 7135 7.079757 2.645938 4 2.697533 9972 6 9972 21723 43.340180 2 21723 120622201 41657878 1466 7135 6.089918 2.645938 5 3.531878 9974 3 9974 375 55.660607 3 375 120622201 10791729 1061 7135 6.176319 2.645938 6 6.627767 9976 6 9976 48889 91.480049 2 48889 120622201 41657878 1466 7135 6.089918 2.645938 

Update 2: function string calling NaN results,

 d2ginv_a <- max(0,(1-Lambda.Value)*temp^(1/Lambda.Value-2)) 

Thanks to everyone for their help and comments, as well as for the speed of responses.

Update: @Ben Bolker correctly that these are negative temp values ​​causing the NaN problem. I missed this with some testing (after commenting out the function so that only temp values ​​are returned and call my data frame of the Test result). This code reproduces the NaN problem:

 > min(Test) [1] -2.103819 > min(Test)^(1/Lambda.Value) [1] NaN 

But entering the value as a value, and then doing the same (?) Calculation gives me the result, so I skipped this when doing manual calculations:

 > -2.103819^(1/Lambda.Value) [1] -6.419792 

Now I have a working code that (I think) uses vectorization, and it is dazzlingly fast. Just in case, if anyone else has this problem, I post the working code below. I had to add a minimum to prevent the problem <0 with calculation. Thanks to everyone who helped and coffee. I tried to put the rnorm results in a data framework, and it really slowed down by creating them that way and then using cbind really fast. Male.Distrib is my full frame of data from 7135 observations, but this code should work on the version of the abbreviation published earlier (not tested).

 Min_bca <- ((.5*min(Male.AddSugar$IntakeAmt))^Lambda.Value-1)/Lambda.Value Test <- Male.Distrib[rep(seq.int(1,nrow(Male.Distrib)), 100), 1:ncol(Male.Distrib)] RnormOutput <- rnorm(nrow(Test),0,1) Male.Final <- cbind(Test,RnormOutput) Male.Final$mc_bca <- Male.Final$FixedEff + (Male.Final$stddev_u2 * Male.Final$RnormOutput) Male.Final$temp <- ifelse(Lambda.Value*Male.Final$mc_bca+1 > Lambda.Value*Min_bca+1, Lambda.Value*Male.Final$mc_bca+1, Lambda.Value*Min_bca+1) Male.Final$ginv_a <- Male.Final$temp^(1/Lambda.Value) Male.Final$d2ginv_a <- ifelse(0 > (1-Lambda.Value)*Male.Final$temp^(1/Lambda.Value-2), 0, (1-Lambda.Value)*Male.Final$temp^(1/Lambda.Value-2)) Male.Final$mc_amount <- Male.Final$ginv_a + Male.Final$d2ginv_a * Male.Resid.Var / 2 

Lessons of the day:

  • The distribution function does not seem to be re-sampled in the loop if you try to do what I tried before
  • you cannot use max() way I tried, since it returns the maximum value from the column, whereas I need a maximum of two values. The ifelse statement is the replacement that needs to be done.
+4
source share
1 answer

Here's an approach that addresses the two biggest performance issues:

  • Instead of iterating over observations ( i ), we calculate them all at once.
  • Instead of iterating over MC ( j ) replicas, we use replicate , which is a simplified apply designed for this purpose.

First, we load the data set and define a function for what you are doing.

 Male.Distrib = read.table('MaleDistrib.txt', check.names=F) getMC <- function(df, Lambda.Value=0.4, Male.Resid.Var=12.1029420429778) { u2 <- df$stddev_u2 * rnorm(nrow(df), mean = 0, sd = 1) mc_bca <- df$FixedEff + u2 temp <- Lambda.Value*mc_bca+1 ginv_a <- temp^(1/Lambda.Value) d2ginv_a <- max(0,(1-Lambda.Value)*temp^(1/Lambda.Value-2)) mc_amount <- ginv_a + d2ginv_a * Male.Resid.Var / 2 mc_amount } 

Then we replicate it several times.

 > replicate(10, getMC(Male.Distrib)) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] 36.72374 44.491777 55.19637 23.53442 23.260609 49.56022 31.90657 25.26383 25.31197 20.58857 [2,] 29.56115 18.593496 57.84550 22.01581 22.906528 22.15470 29.38923 51.38825 13.45865 21.47531 [3,] 61.27075 10.140378 75.64172 28.10286 9.652907 49.25729 23.82104 31.77349 16.24840 78.02267 [4,] 49.42798 22.326136 33.87446 14.00084 25.107143 25.75241 30.20490 33.14770 62.86563 27.33652 [5,] 53.45546 9.673162 22.66676 38.76392 30.786100 23.42267 28.40211 35.95015 43.75506 58.83676 [6,] 34.72440 23.786004 63.57919 8.08238 12.636745 34.11844 14.88339 21.93766 44.53451 51.12331 

Then you can reformat, add identifiers, etc., but this is the idea of ​​the main computing part. Good luck

+4
source

Source: https://habr.com/ru/post/1393028/


All Articles