Generate random lognormal distributions using the form of observed data

Question

Generate random lognormal distributions using the form of observed data

I am trying to correlate some data with a lognormal distribution and from this generate a random lognormal distribution using optimized parameters. After some searching, I found several solutions, but not convincing:

solution1 using the fit function:

import numpy as np from scipy.stats import lognorm mydata = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,5,5,5,5,5,6,6,6,6,6,7,7,7,8,8,8,8,8,9,9,9,10,10,11,12,13,14,14,15,19,19,21,23,25,27,28,30,31,36,41,45,48,52,55,60,68,75,86,118,159,207,354] shape, loc, scale = lognorm.fit(mydata) rnd_log = lognorm.rvs (shape, loc=loc, scale=scale, size=100)

or solution 2 using mu and sigma from the source data:

 import numpy as np from scipy.stats import lognorm mydata = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,5,5,5,5,5,6,6,6,6,6,7,7,7,8,8,8,8,8,9,9,9,10,10,11,12,13,14,14,15,19,19,21,23,25,27,28,30,31,36,41,45,48,52,55,60,68,75,86,118,159,207,354] mu = np.mean([np.log(i) for i in mydata]) sigma = np.std([np.log(i) for i in mydata]) distr = lognorm(mu, sigma) rnd_log = distr.rvs (size=100)

None of these solutions work well:

 import pylab pylab.plot(sorted(mydata, reverse=True), 'ro') pylab.plot(sorted(rnd_log, reverse=True), 'bx')

I'm not sure if I understand correctly how to use distributions, or if I am missing something else ...

I found a solution here: Does anyone have sample code using scipy.stats.distributions? but I can’t get the form from my data ... Am I missing something in using the fit function?

thanks

EDIT:

this is an example to better understand my problem:

 print 'solution 1:' means = [] stdes = [] distr = lognorm(mu, sigma) for _ in xrange(1000): rnd_log = distr.rvs (size=100) means.append (np.mean([np.log(i) for i in rnd_log])) stdes.append (np.std ([np.log(i) for i in rnd_log])) print 'observed mean:',mu , 'mean simulated mean:', np.mean (means) print 'observed std :',sigma, 'mean simulated std :', np.mean (stdes) print '\nsolution 2:' means = [] stdes = [] shape, loc, scale = lognorm.fit(mydata) for _ in xrange(1000): rnd_log = lognorm.rvs (shape, loc=loc, scale=scale, size=100) means.append (np.mean([np.log(i) for i in rnd_log])) stdes.append (np.std ([np.log(i) for i in rnd_log])) print 'observed mean:',mu , 'mean simulated mean:', np.mean (means) print 'observed std :',sigma, 'mean simulated std :', np.mean (stdes)

result:

 solution 1: observed mean: 1.82562655734 mean simulated mean: 1.18929982267 observed std : 1.39003773799 mean simulated std : 0.88985924363 solution 2: observed mean: 1.82562655734 mean simulated mean: 4.50608084668 observed std : 1.39003773799 mean simulated std : 5.44206119499

, whereas if I do the same in R:

 mydata <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,5,5,5,5,5,6,6,6,6,6,7,7,7,8,8,8,8,8,9,9,9,10,10,11,12,13,14,14,15,19,19,21,23,25,27,28,30,31,36,41,45,48,52,55,60,68,75,86,118,159,207,354) meanlog <- mean(log(mydata)) sdlog <- sd(log(mydata)) means <- c() stdes <- c() for (i in 1:1000){ rnd.log <- rlnorm(length(mydata), meanlog, sdlog) means <- c(means, mean(log(rnd.log))) stdes <- c(stdes, sd(log(rnd.log))) } print (paste('observed mean:',meanlog,'mean simulated mean:',mean(means),sep=' ')) print (paste('observed std :',sdlog ,'mean simulated std :',mean(stdes),sep=' '))

I get:

 [1] "observed mean: 1.82562655733507 mean simulated mean: 1.82307191072317" [1] "observed std : 1.39704049131865 mean simulated std : 1.39736545866904"

which is much closer, so I think I'm doing something wrong when using scipy ...

+4

python scipy

fransua Jan 2 '12 at 18:14

source share

1 answer

ars · Accepted Answer · 2012-01-03T20:28:21+0000

The lognormal distribution in scipy is parameterized a little differently than usual. See scipy.stats.lognorm documents, especially the "Notes" section.

Here's how to get the expected results (note that during installation we keep position 0):

 In [315]: from scipy import stats In [316]: x = np.array([1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,5,5,5,5,5,6,6,6,6,6,7,7,7,8,8,8,8,8,9,9,9,10,10,11,12,13,14,14,15,19,19,21,23,25,27,28,30,31,36,41,45,48,52,55,60,68,75,86,118,159,207,354]) In [317]: mu, sigma = stats.norm.fit(np.log(x)) In [318]: mu, sigma Out[318]: (1.8256265573350701, 1.3900377379913127) In [319]: shape, loc, scale = stats.lognorm.fit(x, floc=0) In [320]: np.log(scale), shape Out[320]: (1.8256267737298788, 1.3900309739954713)

Now you can generate samples and confirm your expectations:

 In [321]: dist = stats.lognorm(shape, loc, scale) In [322]: means, sds = [], [] In [323]: for i in xrange(1000): .....: sample = dist.rvs(size=100) .....: logsample = np.log(sample) .....: means.append(logsample.mean()) .....: sds.append(logsample.std()) .....: In [324]: np.mean(means), np.mean(sds) Out[324]: (1.8231068508345041, 1.3816361818739145)

Generate random lognormal distributions using the form of observed data

More articles: