Setting a lognormal distribution in Python using CURVE_FIT

Question

Setting a lognormal distribution in Python using CURVE_FIT

I have a hypothetical function y from x and try to find / fit the logarithmic distribution curve that would best shape the data. I use the curve_fit function and can match the normal distribution, but the curve is not optimized.

The data points y and x are shown below, where y = f (x).

y_axis = [0.00032425299473065838, 0.00063714106162861229, 0.00027009331177605913, 0.00096672396877715144, 0.002388766809835889, 0.0042233337680543182, 0.0053072824980722137, 0.0061291327849408699, 0.0064555344006149871, 0.0065601228278316746, 0.0052574034010282218, 0.0057924488798939255, 0.0048154093097913355, 0.0048619350036057446, 0.0048154093097913355, 0.0045114840997070331, 0.0034906838696562147, 0.0040069911024866456, 0.0027766995669134334, 0.0016595801819374015, 0.0012182145074882836, 0.00098231827111984341, 0.00098231827111984363, 0.0012863691645616997, 0.0012395921040321833, 0.00093554121059032721, 0.0012629806342969417, 0.0010057068013846018, 0.0006081017868837127, 0.00032743942370661445, 4.6777060529516312e-05, 7.0165590794274467e-05, 7.0165590794274467e-05, 4.6777060529516745e-05]

the y axis is the probability of the event occurring in the pins along the x axis:

 x_axis = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0]

I was able to better approach my data using the excel and lognormal approach. When I try to use lognormal in python, the fit does not work and I am doing something wrong.

Below is the code that I use for normal distribution, which seems to be the only thing that I can fit in python (hard to believe):

 #fitting distributino on top of savitzky-golay %matplotlib inline import matplotlib import matplotlib.pyplot as plt import pandas as pd import scipy import scipy.stats import numpy as np from scipy.stats import gamma, lognorm, halflogistic, foldcauchy from scipy.optimize import curve_fit matplotlib.rcParams['figure.figsize'] = (16.0, 12.0) matplotlib.style.use('ggplot') # results from savgol x_axis = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0] y_axis = [0.00032425299473065838, 0.00063714106162861229, 0.00027009331177605913, 0.00096672396877715144, 0.002388766809835889, 0.0042233337680543182, 0.0053072824980722137, 0.0061291327849408699, 0.0064555344006149871, 0.0065601228278316746, 0.0052574034010282218, 0.0057924488798939255, 0.0048154093097913355, 0.0048619350036057446, 0.0048154093097913355, 0.0045114840997070331, 0.0034906838696562147, 0.0040069911024866456, 0.0027766995669134334, 0.0016595801819374015, 0.0012182145074882836, 0.00098231827111984341, 0.00098231827111984363, 0.0012863691645616997, 0.0012395921040321833, 0.00093554121059032721, 0.0012629806342969417, 0.0010057068013846018, 0.0006081017868837127, 0.00032743942370661445, 4.6777060529516312e-05, 7.0165590794274467e-05, 7.0165590794274467e-05, 4.6777060529516745e-05] ## y_axis values must be normalised sum_ys = sum(y_axis) # normalize to 1 y_axis = [_/sum_ys for _ in y_axis] # def gamma_f(x, a, loc, scale): # return gamma.pdf(x, a, loc, scale) def norm_f(x, loc, scale): # print 'loc: ', loc, 'scale: ', scale, "\n" return norm.pdf(x, loc, scale) fitting = norm_f # param_bounds = ([-np.inf,0,-np.inf],[np.inf,2,np.inf]) result = curve_fit(fitting, x_axis, y_axis) result_mod = result # mod scale # results_adj = [result_mod[0][0]*.75, result_mod[0][1]*.85] plt.plot(x_axis, y_axis, 'ro') plt.bar(x_axis, y_axis, 1, alpha=0.75) plt.plot(x_axis, [fitting(_, *result[0]) for _ in x_axis], 'b-') plt.axis([0,35,0,.1]) # convert back into probability y_norm_fit = [fitting(_, *result[0]) for _ in x_axis] y_fit = [_*sum_ys for _ in y_norm_fit] print list(y_fit) plt.show()

I am trying to get answers to two questions:

Is this the best option I get from a normal distribution curve? How can I adapt to this?

Normal distribution result:

How can I match the lognormal distribution of this data, or is there a better distribution that I can use?

I played with a lognormal distribution curve, tuning mu and sigma, it seems to be better suited. I do not understand what I'm doing wrong to get similar results in python.

+5

python numpy scipy statistics distribution

zad0xlik Apr 05 '17 at 10:47

source share

3 answers

Note that if the lognormal curve is correct and you take the logs of both variables, you should have a quadratic relation; even if this is not a suitable scale for the final model (due to dispersion effects - if your dispersion is almost constant on the original scale, it will be overweight of small values), it should at least give a good starting point for non-linear correspondence.

Indeed, besides the first two points, it looks pretty good:

- a quadratic fit to hard points will describe this data well enough and should give suitable starting values if you want to perform a nonlinear fit.

(If an error in x is not possible at all, a lack of matching at the lowest x can have as many problems with an error in x as an error in y)

By the way, this plot seems to hint that the gamma curve may fit a little better than the lognormal (in particular, if you do not want to reduce the influence of these first two points relative to points 4-6). A good initial fit for this can be achieved by regressing log (y) on x and log (x):

The scaled gamma density is g = cx ^ (a-1) exp (-bx) ... using logs, you get log (g) = log (c) + (a-1) log (x) - bx = b0 + b1 log (x) + b2 x ... therefore, substituting log (x) and x in the linear regression procedure will correspond to this. The same warnings about the effects of dispersion apply (so this could be best as a starting point for determining non-linear least squares if your relative error in y is not nearly constant).

+2

Glen_b Apr 05 '17 at 23:48

source share

A discrete distribution might look better - your x all integers. You have a dispersion distribution about 3 times above average, asymmetric - so most likely something like Negative Binomial might work pretty well. Here is a quick fit

r slightly higher than 6, so you can move on to a distribution with a real distribution of r - Polya.

code

 from scipy.misc import comb import matplotlib.pyplot as plt y_axis = [0.00032425299473065838, 0.00063714106162861229, 0.00027009331177605913, 0.00096672396877715144, 0.002388766809835889, 0.0042233337680543182, 0.0053072824980722137, 0.0061291327849408699, 0.0064555344006149871, 0.0065601228278316746, 0.0052574034010282218, 0.0057924488798939255, 0.0048154093097913355, 0.0048619350036057446, 0.0048154093097913355, 0.0045114840997070331, 0.0034906838696562147, 0.0040069911024866456, 0.0027766995669134334, 0.0016595801819374015, 0.0012182145074882836, 0.00098231827111984341, 0.00098231827111984363, 0.0012863691645616997, 0.0012395921040321833, 0.00093554121059032721, 0.0012629806342969417, 0.0010057068013846018, 0.0006081017868837127, 0.00032743942370661445, 4.6777060529516312e-05, 7.0165590794274467e-05, 7.0165590794274467e-05, 4.6777060529516745e-05] x_axis = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0] ## y_axis values must be normalised sum_ys = sum(y_axis) # normalize to 1 y_axis = [_/sum_ys for _ in y_axis] s = 1.0 # shift by 1 to have them all at 0 m = 0.0 for k in range(0, len(x_axis)): m += y_axis[k] * (x_axis[k] - s) v = 0.0 for k in range(0, len(x_axis)): t = (x_axis[k] - s - m) v += y_axis[k] * t * t print(m, v) p = 1.0 - m/v r = int(m*(1.0 - p) / p) print(p, r) z = [] for k in range(0, len(x_axis)): q = comb(k + r - 1, k) * (1.0 - p)**r * p**k z.append(q) plt.plot(x_axis, y_axis, 'ro') plt.plot(x_axis, z, 'b*') plt.axis([0, 35, 0, .1]) plt.show()

+1

Severin pappadeux Apr 6 '17 at 0:46

source share

Severin pappadeux · Accepted Answer · 2017-04-06T01:25:58+0000

In fact, a Gamma distribution might be appropriate, as suggested by @Glen_b. I use the second definition with \ alpha and \ beta.

NB: the trick I use for quick fitting is to calculate the mean and variance, and for a typical two-parameter distribution, it's enough to restore the parameters and get a quick idea if they fit or not.

code

 import math from scipy.misc import comb import matplotlib.pyplot as plt y_axis = [0.00032425299473065838, 0.00063714106162861229, 0.00027009331177605913, 0.00096672396877715144, 0.002388766809835889, 0.0042233337680543182, 0.0053072824980722137, 0.0061291327849408699, 0.0064555344006149871, 0.0065601228278316746, 0.0052574034010282218, 0.0057924488798939255, 0.0048154093097913355, 0.0048619350036057446, 0.0048154093097913355, 0.0045114840997070331, 0.0034906838696562147, 0.0040069911024866456, 0.0027766995669134334, 0.0016595801819374015, 0.0012182145074882836, 0.00098231827111984341, 0.00098231827111984363, 0.0012863691645616997, 0.0012395921040321833, 0.00093554121059032721, 0.0012629806342969417, 0.0010057068013846018, 0.0006081017868837127, 0.00032743942370661445, 4.6777060529516312e-05, 7.0165590794274467e-05, 7.0165590794274467e-05, 4.6777060529516745e-05] x_axis = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0] ## y_axis values must be normalised sum_ys = sum(y_axis) # normalize to 1 y_axis = [_/sum_ys for _ in y_axis] m = 0.0 for k in range(0, len(x_axis)): m += y_axis[k] * x_axis[k] v = 0.0 for k in range(0, len(x_axis)): t = (x_axis[k] - m) v += y_axis[k] * t * t print(m, v) b = m/v a = m * b print(a, b) z = [] for k in range(0, len(x_axis)): q = b**a * x_axis[k]**(a-1.0) * math.exp( - b*x_axis[k] ) / math.gamma(a) z.append(q) plt.plot(x_axis, y_axis, 'ro') plt.plot(x_axis, z, 'b*') plt.axis([0, 35, 0, .1]) plt.show()

Setting a lognormal distribution in Python using CURVE_FIT

More articles: