The goal of this exercise is to create a population distribution of nutrient intake values. In the previous data, measures were repeated, they were deleted, so each row is a unique person in the data frame.
I have this code that works well when testing with a small amount of data lines. For all 7135 lines, this is very slow. I tried the time, but I broke it when the elapsed time on my machine was 15 hours. system.time results were Timing stopped at: 55625.08 2985.39 58673.87 .
I would appreciate any comments regarding simulation acceleration:
Male.MC <-c() for (j in 1:100) { for (i in 1:nrow(Male.Distrib)) { u2 <- Male.Distrib$stddev_u2[i] * rnorm(1, mean = 0, sd = 1) mc_bca <- Male.Distrib$FixedEff[i] + u2 temp <- Lambda.Value*mc_bca+1 ginv_a <- temp^(1/Lambda.Value) d2ginv_a <- max(0,(1-Lambda.Value)*temp^(1/Lambda.Value-2)) mc_amount <- ginv_a + d2ginv_a * Male.Resid.Var / 2 z <- data.frame( RespondentID = Male.Distrib$RespondentID[i], Subgroup = Male.Distrib$Subgroup[i], mc_amount = mc_amount, IndvWeight = Male.Distrib$INDWTS[i]/100 ) Male.MC <- as.data.frame(rbind(Male.MC,z)) } }
For each of 7135 observations in my dataset, 100 simulated nutrient values are created, then converted back to the original measurement level (the simulation uses the results of the nonlinear mixed effect model with converted BoxCox nutrient values).
I would prefer not to use for loops, as I read that they are inefficient in R , but I don't understand enough about apply based options to use them as an alternative. R runs on standalone machines, usually it will be a standard Dell desktop running Windows 7 if it affects recommendations for changing the code.
Update: To reproduce this for testing, Lambda.Value = 0.4 and Male.Resid.Var = 12.1029420429778 and Male.Distrib$stddev_u2 is a constant value for all observations.
str(Male.Distrib)
'data.frame': 7135 obs. of 14 variables: $ RndmEff : num 1.34 -5.86 -3.65 2.7 3.53 ... $ RespondentID: num 9966 9967 9970 9972 9974 ... $ Subgroup : Ord.factor w/ 6 levels "3"<"4"<"5"<"6"<..: 4 3 2 4 1 4 2 5 1 2 ... $ RespondentID: int 9966 9967 9970 9972 9974 9976 9978 9979 9982 9993 ... $ Replicates : num 41067 2322 17434 21723 375 ... $ IntakeAmt : num 33.45 2.53 9.58 43.34 55.66 ... $ RACE : int 2 3 2 2 3 2 2 2 2 1 ... $ INDWTS : num 41067 2322 17434 21723 375 ... $ TOTWTS : num 1.21e+08 1.21e+08 1.21e+08 1.21e+08 1.21e+08 ... $ GRPWTS : num 41657878 22715139 10520535 41657878 10791729 ... $ NUMSUBJECTS : int 1466 1100 1424 1466 1061 1466 1424 1252 1061 1424 ... $ TOTSUBJECTS : int 7135 7135 7135 7135 7135 7135 7135 7135 7135 7135 ... $ FixedEff : num 6.09 6.76 7.08 6.09 6.18 ... $ stddev_u2 : num 2.65 2.65 2.65 2.65 2.65 ...
head(Male.Distrib) is
RndmEff RespondentID Subgroup RespondentID Replicates IntakeAmt RACE INDWTS TOTWTS GRPWTS NUMSUBJECTS TOTSUBJECTS FixedEff stddev_u2 1 1.343753 9966 6 9966 41067 33.449808 2 41067 120622201 41657878 1466 7135 6.089918 2.645938 2 -5.856516 9967 5 9967 2322 2.533528 3 2322 120622201 22715139 1100 7135 6.755664 2.645938 3 -3.648339 9970 4 9970 17434 9.575439 2 17434 120622201 10520535 1424 7135 7.079757 2.645938 4 2.697533 9972 6 9972 21723 43.340180 2 21723 120622201 41657878 1466 7135 6.089918 2.645938 5 3.531878 9974 3 9974 375 55.660607 3 375 120622201 10791729 1061 7135 6.176319 2.645938 6 6.627767 9976 6 9976 48889 91.480049 2 48889 120622201 41657878 1466 7135 6.089918 2.645938
Update 2: function string calling NaN results,
d2ginv_a <- max(0,(1-Lambda.Value)*temp^(1/Lambda.Value-2))
Thanks to everyone for their help and comments, as well as for the speed of responses.
Update: @Ben Bolker correctly that these are negative temp values causing the NaN problem. I missed this with some testing (after commenting out the function so that only temp values are returned and call my data frame of the Test result). This code reproduces the NaN problem:
> min(Test) [1] -2.103819 > min(Test)^(1/Lambda.Value) [1] NaN
But entering the value as a value, and then doing the same (?) Calculation gives me the result, so I skipped this when doing manual calculations:
> -2.103819^(1/Lambda.Value) [1] -6.419792
Now I have a working code that (I think) uses vectorization, and it is dazzlingly fast. Just in case, if anyone else has this problem, I post the working code below. I had to add a minimum to prevent the problem <0 with calculation. Thanks to everyone who helped and coffee. I tried to put the rnorm results in a data framework, and it really slowed down by creating them that way and then using cbind really fast. Male.Distrib is my full frame of data from 7135 observations, but this code should work on the version of the abbreviation published earlier (not tested).
Min_bca <- ((.5*min(Male.AddSugar$IntakeAmt))^Lambda.Value-1)/Lambda.Value Test <- Male.Distrib[rep(seq.int(1,nrow(Male.Distrib)), 100), 1:ncol(Male.Distrib)] RnormOutput <- rnorm(nrow(Test),0,1) Male.Final <- cbind(Test,RnormOutput) Male.Final$mc_bca <- Male.Final$FixedEff + (Male.Final$stddev_u2 * Male.Final$RnormOutput) Male.Final$temp <- ifelse(Lambda.Value*Male.Final$mc_bca+1 > Lambda.Value*Min_bca+1, Lambda.Value*Male.Final$mc_bca+1, Lambda.Value*Min_bca+1) Male.Final$ginv_a <- Male.Final$temp^(1/Lambda.Value) Male.Final$d2ginv_a <- ifelse(0 > (1-Lambda.Value)*Male.Final$temp^(1/Lambda.Value-2), 0, (1-Lambda.Value)*Male.Final$temp^(1/Lambda.Value-2)) Male.Final$mc_amount <- Male.Final$ginv_a + Male.Final$d2ginv_a * Male.Resid.Var / 2
Lessons of the day:
- The distribution function does not seem to be re-sampled in the loop if you try to do what I tried before
- you cannot use
max() way I tried, since it returns the maximum value from the column, whereas I need a maximum of two values. The ifelse statement is the replacement that needs to be done.