Determining the lognormal distribution using Scipy vs Matlab

I am trying to establish a lognormal distribution using Scipy. I have already done this using Matlab before, but due to the need to extend the application beyond statistical analysis, I am trying to reproduce the set values ​​in Scipy.

The following is the Matlab code that I used for my data:

% Read input data (one value per line) x = []; fid = fopen(file_path, 'r'); % reading is default action for fopen disp('Reading network degree data...'); if fid == -1 disp('[ERROR] Unable to open data file.') else while ~feof(fid) [x] = [x fscanf(fid, '%f', [1])]; end c = fclose(fid); if c == 0 disp('File closed successfully.'); else disp('[ERROR] There was a problem with closing the file.'); end end [f,xx] = ecdf(x); y = 1-f; parmhat = lognfit(x); % MLE estimate mu = parmhat(1); sigma = parmhat(2); 

And here is the established plot:

enter image description here

Now here is my Python code with the goal of achieving the same:

 import math from scipy import stats from statsmodels.distributions.empirical_distribution import ECDF # The same input is read as a list in Python ecdf_func = ECDF(degrees) x = ecdf_func.x ccdf = 1-ecdf_func.y # Fit data shape, loc, scale = stats.lognorm.fit(degrees, floc=0) # Parameters sigma = shape # standard deviation mu = math.log(scale) # meanlog of the distribution fit_ccdf = stats.lognorm.sf(x, [sigma], floc=1, scale=scale) 

Here's a fit using Python code.

enter image description here

As you can see, both sets of code are capable of creating good tricks, at least visually.

The problem is that there is a huge difference in the estimated parameters mu and sigma.

From Matlab: mu = 1.62 sigma = 1.29. From Python: mu = 2.78 sigma = 1.74.

Why is there such a difference?

Note. I double checked that both datasets are installed the same way. The same number of points, the same distribution.

Your help is much appreciated! Thanks in advance.

Additional Information:

 import scipy import numpy import statsmodels scipy.__version__ '0.9.0' numpy.__version__ '1.6.1' statsmodels.__version__ '0.5.0.dev-1bbd4ca' 

Matlab version is R2011b.

Edition:

As shown in the answer below, the error lies with Scipy 0.9. I can reproduce the results of mu and sigma from Matlab using Scipy 11.0.

Easy way to update your Scipy:

 pip install --upgrade Scipy 

If you don't have pip (you should!):

 sudo apt-get install pip 
+3
source share
1 answer

The fit method has a bug in scipy 0.9.0, which is fixed in later versions of scipy.

The output of the script below should be:

 Explicit formula: mu = 4.99203450, sig = 0.81691086 Fit log(x) to norm: mu = 4.99203450, sig = 0.81691086 Fit x to lognorm: mu = 4.99203468, sig = 0.81691081 

but with scipy 0.9.0, this

 Explicit formula: mu = 4.99203450, sig = 0.81691086 Fit log(x) to norm: mu = 4.99203450, sig = 0.81691086 Fit x to lognorm: mu = 4.23197270, sig = 1.11581240 

The following script test shows three ways to get the same results:

 import numpy as np from scipy import stats def lognfit(x, ddof=0): x = np.asarray(x) logx = np.log(x) mu = logx.mean() sig = logx.std(ddof=ddof) return mu, sig # A simple data set for easy reproducibility x = np.array([50., 50, 100, 200, 200, 300, 500]) # Explicit formula my_mu, my_sig = lognfit(x) # Fit a normal distribution to log(x) norm_mu, norm_sig = stats.norm.fit(np.log(x)) # Fit the lognormal distribution lognorm_sig, _, lognorm_expmu = stats.lognorm.fit(x, floc=0) print "Explicit formula: mu = %10.8f, sig = %10.8f" % (my_mu, my_sig) print "Fit log(x) to norm: mu = %10.8f, sig = %10.8f" % (norm_mu, norm_sig) print "Fit x to lognorm: mu = %10.8f, sig = %10.8f" % (np.log(lognorm_expmu), lognorm_sig) 

With the option ddof=1 in std. deviation calculation to use an unbiased estimate estimate:

 In [104]: x Out[104]: array([ 50., 50., 100., 200., 200., 300., 500.]) In [105]: lognfit(x, ddof=1) Out[105]: (4.9920345004312647, 0.88236457185021866) 

The matlab lognfit documentation has a note saying that when censorship is not used, lognfit calculates sigma using the square root of an unbiased variance estimate. This corresponds to using ddof = 1 in the above code.

+6
source

Source: https://habr.com/ru/post/1204828/


All Articles