I am trying to correlate some data with a lognormal distribution and from this generate a random lognormal distribution using optimized parameters. After some searching, I found several solutions, but not convincing:
solution1 using the fit function:
import numpy as np from scipy.stats import lognorm mydata = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,5,5,5,5,5,6,6,6,6,6,7,7,7,8,8,8,8,8,9,9,9,10,10,11,12,13,14,14,15,19,19,21,23,25,27,28,30,31,36,41,45,48,52,55,60,68,75,86,118,159,207,354] shape, loc, scale = lognorm.fit(mydata) rnd_log = lognorm.rvs (shape, loc=loc, scale=scale, size=100)
or solution 2 using mu and sigma from the source data:
import numpy as np from scipy.stats import lognorm mydata = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,5,5,5,5,5,6,6,6,6,6,7,7,7,8,8,8,8,8,9,9,9,10,10,11,12,13,14,14,15,19,19,21,23,25,27,28,30,31,36,41,45,48,52,55,60,68,75,86,118,159,207,354] mu = np.mean([np.log(i) for i in mydata]) sigma = np.std([np.log(i) for i in mydata]) distr = lognorm(mu, sigma) rnd_log = distr.rvs (size=100)
None of these solutions work well:
import pylab pylab.plot(sorted(mydata, reverse=True), 'ro') pylab.plot(sorted(rnd_log, reverse=True), 'bx')
I'm not sure if I understand correctly how to use distributions, or if I am missing something else ...
I found a solution here: Does anyone have sample code using scipy.stats.distributions? but I canβt get the form from my data ... Am I missing something in using the fit function?
thanks
EDIT:
this is an example to better understand my problem:
print 'solution 1:' means = [] stdes = [] distr = lognorm(mu, sigma) for _ in xrange(1000): rnd_log = distr.rvs (size=100) means.append (np.mean([np.log(i) for i in rnd_log])) stdes.append (np.std ([np.log(i) for i in rnd_log])) print 'observed mean:',mu , 'mean simulated mean:', np.mean (means) print 'observed std :',sigma, 'mean simulated std :', np.mean (stdes) print '\nsolution 2:' means = [] stdes = [] shape, loc, scale = lognorm.fit(mydata) for _ in xrange(1000): rnd_log = lognorm.rvs (shape, loc=loc, scale=scale, size=100) means.append (np.mean([np.log(i) for i in rnd_log])) stdes.append (np.std ([np.log(i) for i in rnd_log])) print 'observed mean:',mu , 'mean simulated mean:', np.mean (means) print 'observed std :',sigma, 'mean simulated std :', np.mean (stdes)
result:
solution 1: observed mean: 1.82562655734 mean simulated mean: 1.18929982267 observed std : 1.39003773799 mean simulated std : 0.88985924363 solution 2: observed mean: 1.82562655734 mean simulated mean: 4.50608084668 observed std : 1.39003773799 mean simulated std : 5.44206119499
, whereas if I do the same in R:
mydata <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,5,5,5,5,5,6,6,6,6,6,7,7,7,8,8,8,8,8,9,9,9,10,10,11,12,13,14,14,15,19,19,21,23,25,27,28,30,31,36,41,45,48,52,55,60,68,75,86,118,159,207,354) meanlog <- mean(log(mydata)) sdlog <- sd(log(mydata)) means <- c() stdes <- c() for (i in 1:1000){ rnd.log <- rlnorm(length(mydata), meanlog, sdlog) means <- c(means, mean(log(rnd.log))) stdes <- c(stdes, sd(log(rnd.log))) } print (paste('observed mean:',meanlog,'mean simulated mean:',mean(means),sep=' ')) print (paste('observed std :',sdlog ,'mean simulated std :',mean(stdes),sep=' '))
I get:
[1] "observed mean: 1.82562655733507 mean simulated mean: 1.82307191072317" [1] "observed std : 1.39704049131865 mean simulated std : 1.39736545866904"
which is much closer, so I think I'm doing something wrong when using scipy ...